使用 Stanford CoreNLP 进行高效批处理

Question

是否可以从命令行使用 CoreNLP 加速文档的批处理，以便模型仅加载一次？我想 trim 过程中任何不必要的重复步骤。

我有 320,000 个文本文件，我正在尝试使用 CoreNLP 处理它们。期望的结果是 320,000 个完成的 XML 文件结果。

要从一个文本文件转换为一个 XML 文件，我从命令行使用 CoreNLP jar 文件：

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props config.properties 
-file %%~f -outputDirectory MyOutput -outputExtension .xml -replaceExtension`

这会加载模型并执行各种机器学习魔术。我面临的问题是，当我尝试为目录中的每个文本循环时，我创建了一个据我估计将在 44 天内完成的进程。在过去的 7 天里，我的桌面上确实有一个命令提示符循环播放，但我离完成还差得很远。批处理脚本中的循环 I 运行:

for %%f in (Data\*.txt) do (
    java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props config.properties
    -file %%~f -outputDirectory Output -outputExtension .xml -replaceExtension
)

我正在使用 config.properties 中指定的这些注释器：
annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, sentiment

Answer 1

我对 Stanford CoreNLP 一无所知，所以我用谷歌搜索了它（你没有包含任何 link），在 this page 中我找到了这个描述（在 "Parsing a file and saving the output as XML" 下方）：

If you want to process a list of files use the following command line:

java -cp stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR CONFIGURATION FILE ] -filelist A FILE CONTAINING YOUR LIST OF FILES

where the -filelist parameter points to a file whose content lists all files to be processed (one per line).

所以我想如果将所有文本文件的列表存储在列表文件中，您可能会更快地处理文件：

dir /B *.txt > list.lst

... 然后在 Stanford CoreNLP 的单次执行中将该列表传递到 -filelist list.lst 参数中。

使用 Stanford CoreNLP 进行高效批处理

Efficient batch processing with Stanford CoreNLP

batch-file

stanford-nlp