Training Mallet LDA前如何将文档细分成句子

Question

你们对我在训练 MALLET LDA 之前可以将文档细分为句子的方式有什么建议吗？

提前致谢

Answer 1

根据您对句子的定义，这可以在 Java 中使用 String.split("\.\s") 完成。假设用户以句号结束一个句子并以 whitespace 开始一个新句子。由于 split 的参数是正则表达式，因此句点被转义。 \s 表示 "any whitespace"，这也将处理行尾和制表符。

String test = "Hello. World. Cat eats dog.";
String[] splitString = test.split("\.\s");

splitString的内容现在是{"Hello", "World", "Cat eats dog."}，注意最后一个句点没有被删除，因为它后面没有白色space。您现在可以将句子写入文件。您可以使用 BufferedWriter 来做到这一点：

try{
    String filename = "test";
    int i = 0;
    for(String sentence : splitString) {
        File file = new File(filename+""+i+""+".txt");
        file.createNewFile();
        /*returns false if the file already exists (you can prevent
          overriding like this)*/
        BufferedWriter writer = new BufferedWriter(new FileWriter(file));
        writer.append(sentence + "\n");
        i++;
     }
} catch(IOException ioexception)
{
    System.out.println(ioexception.getMessage());
    System.exit(1);
}

这现在将拆分语句打印在一个新文件中，每个文件都在不同的文件中。不过要小心，因为这可能会导致 FAT32 格式化系统（标准）出现 space 问题，因为它们为每个文件分配 32kB，无论它是否至少 32kB 大（文件是 8kB，但占用 32kB space 在驱动器上）。这可能有点不切实际，但它确实有效。现在你只需 import-dir 所有这些文件所在的目录，并使用 LDA 中的文件。您还可以在此处阅读部分教程：

https://programminghistorian.org/lessons/topic-modeling-and-mallet#getting-your-own-texts-into-mallet

对于较大的文件（大约 5000 个句子及以上 [产生至少 160 MB 的数据]）我建议您进行拆分，但不要写入多个文件，您只需写入一个文件并按照自己的方式编写使用 MALLET API 导入数据。查看 http://mallet.cs.umass.edu/import-devel.php for a developers guide and at http://mallet.cs.umass.edu/api/ 了解更多信息。

Answer 2

例如，您可以使用 OpenNLP 句子检测工具。它们已经存在了一段时间，并且在大多数情况下表现不错。

文档是 here, the models can be downloaded here。请注意，1.5 版模型与较新的 opennlp-tools 1.8.4

版完全兼容

如果您使用的是 Maven，只需将以下内容添加到您的 pom.xml 文件中即可。

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-tools</artifactId>
  <version>1.8.4</version>
</dependency>

如果您打算将模型输入从文档切换到句子，请注意原始 LDA（这也会影响当前在 Mallet 中的实现，afaik）可能不会产生令人满意的结果，因为单词 co-occurrence 计数是句子不是很清楚。

我建议调查段落级别是否更有趣。可以使用换行模式提取文档中的段落。例如，当您有两个连续的换行符时，一个新段落开始。

Answer 3

这些函数将准备要传递到 LDA 中的文档。我还会考虑设置一个 bow_corpus 因为 LDA 接受数字而不是句子。就像单词 "going" 被词干 "go" 然后 numbered/indexed 说 2343 并按频率计算可能它弹出两次所以 bow_corpus 将是 (2343, 2 ) LDA 所期望的。

# Gensim unsupervised topic modeling, natural language processing, statistical machine learning
import gensim
# convert a document to a list of tolkens
from gensim.utils import simple_preprocess
# remove stopwords - words that are not telling: "it" "I" "the" "and" ect.
from gensim.parsing.preprocessing import STOPWORDS
# corpus iterator 
from gensim import corpora, models

# nltk - Natural Language Toolkit
# lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed 
# into present.
# stemmed — words are reduced to their root form.
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

# Create functions to lemmatize stem, and preprocess

# turn beautiful, beautifuly, beautified into stem beauti 
def lemmatize_stemming(text):
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result


# send the comments row through the preprocessing step
# map itterates through rows into a function

processed_docs = documents['Your Comments title header'].map(preprocess)

Training Mallet LDA前如何将文档细分成句子

How to subdivide the documents into sentences before Training Mallet LDA

mallet

lda