命名实体识别 Python
Named Entity Recognition Python
我想做的事情:提取出所有出现的n个连续单词,并且全部以大写字母开头。
Input: ("Does John Doe eat pizza in New York?", 2)
Output: [("Does", "John"),("John", "Doe")("New","York")]
Input: ("Does John Doe eat pizza in New York?", 3)
Output: [("Does", "John","Doe")]
这是我到目前为止的想法:
# create text file
fw = open("ngram.txt", "w")
fw.write ("Does John Doe eat pizza in New York?")
fw.close()
def UpperCaseNGrams (file,n):
fr = open (file, "r")
text = fr.read().split()
ngramlist = [text[word:word+n] for word in range(len(text)-(n-1)) if word[0].isupper() if word+n[0].isupper()]
return ngramlist
print (UpperCaseNGrams("ngram.txt",2))
我收到以下错误:
类型错误:'int' 对象不可订阅
我必须更改什么才能使其正常工作?
在word+n[0].isupper()
中,word
和n
都是int
类型,因此不能使用[]
进行索引,即整数不可下标.
我认为您的意图是检查当前单词之后的第 n 个单词是否以大写字母开头,但是,这将通过 text[word+n][0]
完成。无论如何,我认为您的方法不适用于 2 以外的 n
值,例如如果 n
为 3,则需要检查当前单词和当前单词之后的第 n 个单词之间的所有单词是否都大写。
最简单的解决方法是使用 all()
来检查每个单词子列表是否以大写字母开头:
ngramlist = [text[word:word+n] for word in range(len(text)-(n-1))
if all(s[0].isupper() for s in text[word:word+n])]
如果你想要更快一点,你可以这样做将 运行 个大写单词组合在一起:
from itertools import groupby
text = 'Does John Doe eat pizza in New York?'.split()
caps_words = [list(v) for g,v in groupby(text, key=lambda x: x[0].isupper()) if g]
print(caps_words)
这将输出
[['Does', 'John', 'Doe'], ['New', 'York?']]
现在您需要从每个 运行:
中提取长度为 n
的子列表
ngrams = []
n = 2
for run in caps_words:
ngrams.extend(run[i:i+n] for i in range(len(run)-(n-1)))
结果是 ngrams
:
[['Does', 'John'], ['John', 'Doe'], ['New', 'York?']]
并且 n
= 3:
[['Does', 'John', 'Doe']]
将所有这些放在一起(并将 ngram 累加器转换为列表理解)会产生如下函数:
from itertools import groupby
def upper_case_ngrams(words, n):
caps_words = [list(v) for g,v in groupby(words, key=lambda x: x[0].isupper()) if g]
return [tuple(run[i:i+n]) for run in caps_words
for i in range(len(run)-(n-1))]
text = 'Does John Doe eat pizza in New York?'.split()
for n in range(1, 5):
print(upper_case_ngrams(text, n))
输出
[('Does',), ('John',), ('Doe',), ('New',), ('York?',)]
[('Does', 'John'), ('John', 'Doe'), ('New', 'York?')]
[('Does', 'John', 'Doe')]
[]
我想做的事情:提取出所有出现的n个连续单词,并且全部以大写字母开头。
Input: ("Does John Doe eat pizza in New York?", 2)
Output: [("Does", "John"),("John", "Doe")("New","York")]
Input: ("Does John Doe eat pizza in New York?", 3)
Output: [("Does", "John","Doe")]
这是我到目前为止的想法:
# create text file
fw = open("ngram.txt", "w")
fw.write ("Does John Doe eat pizza in New York?")
fw.close()
def UpperCaseNGrams (file,n):
fr = open (file, "r")
text = fr.read().split()
ngramlist = [text[word:word+n] for word in range(len(text)-(n-1)) if word[0].isupper() if word+n[0].isupper()]
return ngramlist
print (UpperCaseNGrams("ngram.txt",2))
我收到以下错误:
类型错误:'int' 对象不可订阅
我必须更改什么才能使其正常工作?
在word+n[0].isupper()
中,word
和n
都是int
类型,因此不能使用[]
进行索引,即整数不可下标.
我认为您的意图是检查当前单词之后的第 n 个单词是否以大写字母开头,但是,这将通过 text[word+n][0]
完成。无论如何,我认为您的方法不适用于 2 以外的 n
值,例如如果 n
为 3,则需要检查当前单词和当前单词之后的第 n 个单词之间的所有单词是否都大写。
最简单的解决方法是使用 all()
来检查每个单词子列表是否以大写字母开头:
ngramlist = [text[word:word+n] for word in range(len(text)-(n-1))
if all(s[0].isupper() for s in text[word:word+n])]
如果你想要更快一点,你可以这样做将 运行 个大写单词组合在一起:
from itertools import groupby
text = 'Does John Doe eat pizza in New York?'.split()
caps_words = [list(v) for g,v in groupby(text, key=lambda x: x[0].isupper()) if g]
print(caps_words)
这将输出
[['Does', 'John', 'Doe'], ['New', 'York?']]
现在您需要从每个 运行:
中提取长度为n
的子列表
ngrams = []
n = 2
for run in caps_words:
ngrams.extend(run[i:i+n] for i in range(len(run)-(n-1)))
结果是 ngrams
:
[['Does', 'John'], ['John', 'Doe'], ['New', 'York?']]
并且 n
= 3:
[['Does', 'John', 'Doe']]
将所有这些放在一起(并将 ngram 累加器转换为列表理解)会产生如下函数:
from itertools import groupby
def upper_case_ngrams(words, n):
caps_words = [list(v) for g,v in groupby(words, key=lambda x: x[0].isupper()) if g]
return [tuple(run[i:i+n]) for run in caps_words
for i in range(len(run)-(n-1))]
text = 'Does John Doe eat pizza in New York?'.split()
for n in range(1, 5):
print(upper_case_ngrams(text, n))
输出
[('Does',), ('John',), ('Doe',), ('New',), ('York?',)] [('Does', 'John'), ('John', 'Doe'), ('New', 'York?')] [('Does', 'John', 'Doe')] []