Stanford CoreNLP 输出在 Python 中非常慢

Question

我正在使用 NLTK 的 StanfordDependencyParser 来生成依赖树。这是代码

cpath = "path to stanford-corenlp-4.2.0-models-english.jar" + os.pathsep + "path to stanford-parser.jar"

if cpath not in os.environ['CLASSPATH']:
     os.environ['CLASSPATH'] = cpath + os.pathsep + os.environ['CLASSPATH']

# TODO: DEPRECATED
# self.dependency_parser_instance_corenlp = StanfordDependencyParser(path_to_models_jar="path to stanford-corenlp-4.2.0-models-english.jar", encoding='utf8')

dependencies = [list(parse.triples()) for parse in self.dependency_parser_instance_corenlp.raw_parse(query)]

# Encode every string in tree to utf8 so string matching will work
for dependency in dependencies[0]:
    dependency[0][0].encode('utf-8')
    dependency[0][1].encode('utf-8')
    dependency[1].encode('utf-8')
    dependency[2][0].encode('utf-8')
    dependency[2][1].encode('utf-8')

对于包含 10 个单词的句子，生成输出大约需要 1.5 秒。这是预期的吗？你能请你提高速度的步骤吗？我已经尝试使用 SR 解析器并从模型 JAR 中删除所有额外的文件夹（如 coref、lexparser、ner、tagger）。

Answer 1

是的，（旧的 NLTK 实现）StanfordDependencyParser 非常慢。你不应该使用它。您应该使用 nltk.parse.corenlp 模块中的类和方法，它们快很多。

有关详细信息，请参阅 CoreNLP documentation or this tutorial 中的注释。

Stanford CoreNLP 输出在 Python 中非常慢

Stanford CoreNLP output is very slow in Python

python

nltk

stanford-nlp