加快 CoreNLP 情绪中的注释时间

Question

在我的数据集中，我有 100,000 个文本文件，我正在尝试使用 CoreNLP 处理它们。期望的结果是 100,000 个完成的文本文件结果，其中将每个句子分类为具有正面、负面或中性情绪。为了从一个文本文件到另一个文本文件，我使用了 CoreNLP jar 文件，它在下面的命令行中使用。

 java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -fileList list.txt

这需要很长时间才能完成，因为我无法让模型获取文件列表中的每个文件，但它会将单个路径行作为模型的输入。

我还尝试在这个 link 中实现其他一些方法，但我无法从这些方法中获得所需的结果。 https://stanfordnlp.github.io/CoreNLP/cmdline.html#classpath

是否有更好、更快的方法来执行此操作并加快流程？

Answer 1

试试这个命令：

java -Xmx14g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,parse,sentiment -parse.model edu/stanford/nlp/models/srparser/englishSR.ser.gz -outputFormat text -filelist list.txt

它将使用更快的 shift-reduce 解析器。这将运行遍历 list.txt 中的每个文件（每行 1 个文件）并处理它。

加快 CoreNLP 情绪中的注释时间

Speed up annotation time in CoreNLP sentiment

java

command-line

nlp

stanford-nlp

sentiment-analysis