从句子 python 中提取名字和姓氏的最佳方法(波斯语文本)

best way to extract the first-name and last-name from sentence python (Persian text)

我有超过 20,000 个名字和姓氏,我想检查该句子中是否有我的 first-namelast-name dataset, 这是我的 dataset

l-name   f-name  
میلاد  جورابلو
علی    احمدی
امیر    احمدی

这是 sentence 样本

sentence = 'امروز با میلاد احمدی رفتم بیرون'

英文版 dataset

l-name    f-name
Smith     John
Johnson   Anthony
Williams  Ethan

这是英文版的句子

sentence = 'I am going out with John Williams today'

我希望我的输出是这样的

first_name = ['John']
last_name = ['Williams']

如果您想以一种天真的方式解决这个问题,您可以考虑正则表达式,但是这是基于所有名字和姓氏都大写的假设。

sentence = 'I am going out with John Williams today'
name = re.search(r"[A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+", sentence).group()
print(name) # Outputs: John Williams

这将搜索一个大写字母,后跟任意数量的小写字母,然后是 space,然后是前面模式的重复。

除此之外,您可以考虑使用命名实体识别 (NER),使用预构建的库来识别文本中的名称。请参阅此处了解更多详情。 https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/

编辑:

我要补充一点,如果同一个句子中有多个名字,你可以申请re.findall():

sentence = 'I am going out with John Williams and William Smith today'
names = re.findall(r"[A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+", sentence)
print(names) # Outputs: ['John Williams', 'William Smith']

只需从每一列中获取名称列表并检查字符串是否包含这些列表中的任何元素。

import pandas as pd 

names = [['John', 'Smith'], ['Anthony', 'Johnson'], ['Ethan', 'Williams']]
df = pd.DataFrame(names, columns = ['f_name', 'l_name'])


fname_list = df['f_name'].to_list()
lname_list = df['l_name'].to_list()

sentence = 'I am going out with John Williams today'
sentence = sentence.split()

fname_exist = [e for e in sentence if(e in fname_list)]
lname_exist = [e for e in sentence if(e in lname_list)]

if(len(fname_exist) > 0 and len(lname_exist) > 0):
    print('first name: ' + fname_exist[0])
    print('last name name: ' + lname_exist[0])

输出:

first name: John
last name: Williams