命名实体识别 Python

Named Entity Recognition Python

我想做的事情:提取出所有出现的n个连续单词,并且全部以大写字母开头。

Input: ("Does John Doe eat pizza in New York?", 2)
Output: [("Does", "John"),("John", "Doe")("New","York")]

Input: ("Does John Doe eat pizza in New York?", 3)
Output: [("Does", "John","Doe")]

这是我到目前为止的想法:

# create text file
fw = open("ngram.txt", "w")
fw.write ("Does John Doe eat pizza in New York?")
fw.close()

def UpperCaseNGrams (file,n):
    fr = open (file, "r")
    text = fr.read().split()

    ngramlist = [text[word:word+n] for word in range(len(text)-(n-1)) if word[0].isupper() if word+n[0].isupper()]  
    return ngramlist

print (UpperCaseNGrams("ngram.txt",2))

我收到以下错误:
类型错误:'int' 对象不可订阅

我必须更改什么才能使其正常工作?

word+n[0].isupper()中,wordn都是int类型,因此不能使用[]进行索引,即整数不可下标.

我认为您的意图是检查当前单词之后的第 n 个单词是否以大写字母开头,但是,这将通过 text[word+n][0] 完成。无论如何,我认为您的方法不适用于 2 以外的 n 值,例如如果 n 为 3,则需要检查当前单词和当前单词之后的第 n 个单词之间的所有单词是否都大写。

最简单的解决方法是使用 all() 来检查每个单词子列表是否以大写字母开头:

ngramlist = [text[word:word+n] for word in range(len(text)-(n-1))
                 if all(s[0].isupper() for s in text[word:word+n])]

如果你想要更快一点,你可以这样做将 运行 个大写单词组合在一起:

from itertools import groupby

text = 'Does John Doe eat pizza in New York?'.split()
caps_words = [list(v) for g,v in groupby(text, key=lambda x: x[0].isupper()) if g]
print(caps_words)

这将输出

[['Does', 'John', 'Doe'], ['New', 'York?']]

现在您需要从每个 运行:

中提取长度为 n 的子列表
ngrams = []
n = 2
for run in caps_words:
    ngrams.extend(run[i:i+n] for i in range(len(run)-(n-1)))

结果是 ngrams:

[['Does', 'John'], ['John', 'Doe'], ['New', 'York?']]

并且 n = 3:

[['Does', 'John', 'Doe']]

将所有这些放在一起(并将 ngram 累加器转换为列表理解)会产生如下函数:

from itertools import groupby

def upper_case_ngrams(words, n):
    caps_words = [list(v) for g,v in groupby(words, key=lambda x: x[0].isupper()) if g]
    return [tuple(run[i:i+n]) for run in caps_words
                for i in range(len(run)-(n-1))]

text = 'Does John Doe eat pizza in New York?'.split()
for n in range(1, 5):
   print(upper_case_ngrams(text, n))

输出

[('Does',), ('John',), ('Doe',), ('New',), ('York?',)]
[('Does', 'John'), ('John', 'Doe'), ('New', 'York?')]
[('Does', 'John', 'Doe')]
[]