NLP

Question

我在 Python 3 中进行自然语言处理 (NLP)，更具体地说是在哈利波特系列丛书中进行命名实体识别 (NER)。我正在使用 StanfordNER，它运行良好但需要大量时间...

我在网上做了一些研究，为什么会这么慢，但我似乎找不到任何真正适合我的代码的东西，老实说，我认为问题更多在于我编写的（糟糕的）方式代码。

所以这是我现在写的：

import string
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk.tag.stanford as st

tagger = st.StanfordNERTagger('_path_/stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz', '_path_/stanford-ner-2017-06-09/stanford-ner.jar')

#this is just to read the file

hp = open("books/hp1.txt", 'r', encoding='utf8')
lhp = hp.readlines()

#a small function I wrote to divide the book in sentences

def get_sentences(lbook):
    sentences = []
    for k in lbook:
        j = sent_tokenize(k)
        for i in j:
            if bool(i):
                sentences.append(i)
    return sentences

#a function to divide a sentence into words

def get_words(sentence):
    words = word_tokenize(sentence)
    return words

sentences = get_sentences(lhp)

#and now the code I wrote to get all the words labeled as PERSON by the StanfordNER tagger

characters = []
    for i in sentence:
    characters = [tag[0] for tag in tagger.tag(get_words(sentences[i])) if tag[1]=="PERSON"]
    print(characters)

正如我所解释的那样，现在的问题是代码需要花费大量时间...所以我想知道，这是正常的还是 我可以通过重写代码来节省时间更好的方法？ 如果是这样，你能帮我吗？

Answer 1

瓶颈是tagger.tag方法，开销很大。因此，为每个句子调用它会导致程序非常慢。除非有额外的需要将这本书分成句子，否则我会一次处理整个文本：

with open('books/hp1.txt', 'r') as content_file:
    all_text = content_file.read()
    tags = tagger.tag(word_tokenize(all_text))
    characters = [tag[0] for tag in tags if tag[1] == "PERSON"]
    print(characters)

现在如果你想知道的是，比方说，每个字符在哪个句子中被提及，那么你可以像上面的代码一样，先获取 characters 中的字符名称，然后循环遍历检查 characters 中的元素是否存在的句子。

如果文件大小是一个问题（尽管大多数书籍的 .txt 文件加载到内存中应该不是问题），那么您可以阅读一些数字而不是阅读整本书 n一次的句子。从您的代码中，像这样修改您的 for 循环：

n = 1000
for i in range(0, len(sentences), n):
    scs = '. '.join(sentences[i:i + n])
    characters = [tag[0] for tag in tagger.tag(get_words(scs)) if tag[1]=="PERSON"]

一般的想法是尽量减少对 tagger.tag 的调用，因为它的开销很大。

NLP - 命名实体识别的速度（StanfordNER）

NLP - Speed of Named Entity Recognition (StanfordNER)

python

named-entity-recognition

stanford-nlp

python-3.x