NLP筛选专有名词后如何区分一般人名

How to distinguish general people's names after screening proper nouns in NLP

背景

想知道在NLP中筛选专有名词后如何区分一般人名

preferred output
['Hanna', 'Mike', 'Cathy', 'Tom']

问题

我能够使用 nlp 库(如 spaCy)提取专有名词,但是

输出

['Hawaii', 'Hanna', 'Mike', 'Barbacoa', 'Mexico', 'Cathy', 'Tom']

代码

import spacy
nlp = spacy.load("en_core_web_sm")

ppnouns = []

texts = [
"Mike, Tom, Cathy agreed; it was a magnificent evening.",

"Mike hopes that, when he's built up my savings, he'll be able to travel to Mexico to eat Barbacoa.",

"Of all the places to travel, Hawaii is at the top of Tom's list.",

"Would you like to travel with Hanna?"
]

#extract proper nouns
for i in range(len(texts)):
    text = texts[i]
    for word in nlp(text):
        if word.pos_ == 'PROPN':
            ppnouns.append(word.text)

print(list(set(ppnouns)))

文字来源于以下网页:https://examples.yourdictionary.com/reference/examples/examples-of-complete-sentences.html

我已经为我的代码编辑了上面的例句。

我尝试做的事情

尝试使用大型英语词汇数据库 WordNet 找出类别,但没有return人名或不同类别。

我目前的输入和输出很小,但我打算处理更大的输入,所以我没有像下面那样自己创建字典。

dic = {'given_names'['Jack', 'Mike', 'Mary', 'Cathy', 'Tom', 'Jessica', 'Megan', 'Hanna'], 
'family_names':['Smith', 'Miller', 'Lopez', 'Williams', 'Johnson']}

我该如何解决这个问题?有什么解决方案或工具可以实现我想做的事情吗?

form - WordNet Search - 3.1

#input
Hanna
#output
Your search did not return any results.

#input
Tom
#output
S: (n) tom, tomcat (male cat)
S: (n) turkey cock, gobbler, tom, tom turkey (male turkey)

开发环境

Python3.8

您想做的是提取带有标签“PERSON”的命名实体。当前 spacy 您可以达到:

import spacy
nlp = spacy.load("en_core_web_sm")

texts = [
"Mike, Tom, Cathy agreed; it was a magnificent evening.",
"Mike hopes that, when he's built up my savings, he'll be able to travel to Mexico to eat Barbacoa.",
"Of all the places to travel, Hawaii is at the top of Tom's list.",
"Would you like to travel with Hanna?"
]

docs = nlp.pipe(texts)

names = []
for doc in docs:
    names.extend([ent for ent in doc.ents if ent.label_=="PERSON"])
print(names)
[Mike, Tom, Cathy, Mike, Tom]

请注意,列表中缺少 Hanna,这意味着 spacy 的概率语言模型不会将其识别为名称。如果你想要一个确定性模型,最好定义一个你想要选择的字典。