如何使用 python nltk 加速 stanford NER 的 NE 识别

Question

首先，我将文件内容标记为句子，然后对每个句子调用 Stanford NER。但是这个过程真的很慢。我知道如果我在整个文件内容上调用它会更快，但我在每个句子上调用它，因为我想在 NE 识别之前和之后为每个句子建立索引。

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
    sentences = sent_tokenize(filecontent) #break file content into sentences
    for j,sent in enumerate(sentences): 
        words = word_tokenize(sent) #tokenize sentences into words
        ne_tags = st.tag(words) #get tagged NEs from Stanford NER

这可能是因为每个句子都调用st.tag()，但是有什么办法可以让它运行更快吗？

编辑

我想单独标记句子的原因是我想将句子写入文件（如句子索引），以便在稍后阶段给出 ne 标记的句子，我可以得到未处理的句子（我'我也在这里做词形还原）

文件格式：

(sent_number, orig_sentence, NE_and_lemmatized_sentence)

Answer 1

来自StanfordNERTagger, there is the tag_sents() function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)

Answer 2

首先从这里下载 Stanford CoreNLP 3.5.2：http://nlp.stanford.edu/software/corenlp.shtml

假设您将下载放在 /User/username/stanford-corenlp-full-2015-04-20

此 Python 代码将运行管道：

stanford_distribution_dir = "/User/username/stanford-corenlp-full-2015-04-20"
list_of_sentences_path = "/Users/username/list_of_sentences.txt"
stanford_command = "cd %s ; java -Xmx2g -cp \"*\" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ssplit.eolonly -filelist %s -outputFormat json" % (stanford_distribution_dir, list_of_sentences_path)
os.system(stanford_command)

这里有一些示例 Python 代码，用于加载到 .json 文件中以供参考：

import json
sample_json = json.loads(file("sample_file.txt.json").read()

此时 sample_json 将是一个很好的字典，其中包含文件中的所有句子。

for sentence in sample_json["sentences"]:
  tokens = []
  ner_tags = []
  for token in sentence["tokens"]:
    tokens.append(token["word"])
    ner_tags.append(token["ner"])
  print (tokens, ner_tags)

list_of_sentences.txt 应该是带有句子的文件列表，例如：

input_file_1.txt
input_file_2.txt
...
input_file_100.txt

所以 input_file.txt（每行一个句子）将生成 input_file.txt.json 一旦 Java 命令是运行并且 . json 文件将具有 NER 标签。您可以只为每个输入文件加载 .json 并轻松获得（句子，ner 标记序列）对。如果您更喜欢，可以尝试使用 "text" 作为替代输出格式。但是 "json" 会创建一个很好的 .json 文件，你可以用 json.loads(...) 加载它，然后你就会有一个很好的字典，你可以用它来访问句子和注释。

这样，您只需为所有文件加载一次管道。

Answer 3

你可以使用 stanford ner 服务器。速度会快很多。

安装 sner

pip install sner

运行 ner 服务器

cd your_stanford_ner_dir
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz

from sner import Ner

test_string = "Alice went to the Museum of Natural History."
tagger = Ner(host='localhost',port=9199)
print(tagger.get_entities(test_string))

这段代码结果是

[('Alice', 'PERSON'),
 ('went', 'O'),
 ('to', 'O'),
 ('the', 'O'),
 ('Museum', 'ORGANIZATION'),
 ('of', 'ORGANIZATION'),
 ('Natural', 'ORGANIZATION'),
 ('History', 'ORGANIZATION'),
 ('.', 'O')]

查看更多细节https://github.com/caihaoyu/sner

Answer 4

尝试了几个选项后，我喜欢Stanza。它由 Stanford 开发，实现起来非常简单，我不必自己弄清楚如何正确启动服务器，它极大地提高了我的程序速度。它实现了 18 种不同的对象分类。

我在搜索 the documentation 时找到了 Stanza。

下载： pip install stanza

然后在 Python:

import stanza
stanza.download('en') # download English model
nlp = stanza.Pipeline('en') # initialize English neural pipeline
doc = nlp("My name is John Doe.") # run annotation over a sentence or multiple sentences

如果你只想要一个特定的工具（NER），你可以用 processors 指定为： nlp = stanza.Pipeline('en',processors='tokenize,ner')

对于类似于 OP 生成的输出：

classified_text = [(token.text,token.ner) for i, sentence in enumerate(doc.sentences) for token in sentence.tokens]
print(classified_text)
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'B-PERSON'), ('Doe', 'E-PERSON')]

但要生成仅包含可识别实体的单词的列表：

classified_text = [(ent.text,ent.type) for ent in doc.ents]
[('John Doe', 'PERSON')]

它产生了一些我非常喜欢的特性：

您可以通过 doc.sentences 访问每个句子。
它将 John Doe 合并为一个 'PERSON' 对象，而不是将每个词分类为一个单独的人实体。
如果您确实需要每个单独的词，您可以提取这些词并识别它是对象的哪一部分（'B' 表示对象中的第一个词，'I' 表示中间词, 'E' 对象中的最后一个单词)

如何使用 python nltk 加速 stanford NER 的 NE 识别

how to speed up NE recognition with stanford NER with python nltk

python

nlp

named-entity-recognition

nltk

stanford-nlp