从句子 python 中提取名字和姓氏的最佳方法(波斯语文本)
best way to extract the first-name and last-name from sentence python (Persian text)
我有超过 20,000 个名字和姓氏,我想检查该句子中是否有我的 first-name
或 last-name
dataset
, 这是我的 dataset
l-name f-name
میلاد جورابلو
علی احمدی
امیر احمدی
这是 sentence
样本
sentence = 'امروز با میلاد احمدی رفتم بیرون'
英文版 dataset
l-name f-name
Smith John
Johnson Anthony
Williams Ethan
这是英文版的句子
sentence = 'I am going out with John Williams today'
我希望我的输出是这样的
first_name = ['John']
last_name = ['Williams']
如果您想以一种天真的方式解决这个问题,您可以考虑正则表达式,但是这是基于所有名字和姓氏都大写的假设。
sentence = 'I am going out with John Williams today'
name = re.search(r"[A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+", sentence).group()
print(name) # Outputs: John Williams
这将搜索一个大写字母,后跟任意数量的小写字母,然后是 space,然后是前面模式的重复。
除此之外,您可以考虑使用命名实体识别 (NER),使用预构建的库来识别文本中的名称。请参阅此处了解更多详情。 https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/
编辑:
我要补充一点,如果同一个句子中有多个名字,你可以申请re.findall()
:
sentence = 'I am going out with John Williams and William Smith today'
names = re.findall(r"[A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+", sentence)
print(names) # Outputs: ['John Williams', 'William Smith']
只需从每一列中获取名称列表并检查字符串是否包含这些列表中的任何元素。
import pandas as pd
names = [['John', 'Smith'], ['Anthony', 'Johnson'], ['Ethan', 'Williams']]
df = pd.DataFrame(names, columns = ['f_name', 'l_name'])
fname_list = df['f_name'].to_list()
lname_list = df['l_name'].to_list()
sentence = 'I am going out with John Williams today'
sentence = sentence.split()
fname_exist = [e for e in sentence if(e in fname_list)]
lname_exist = [e for e in sentence if(e in lname_list)]
if(len(fname_exist) > 0 and len(lname_exist) > 0):
print('first name: ' + fname_exist[0])
print('last name name: ' + lname_exist[0])
输出:
first name: John
last name: Williams
我有超过 20,000 个名字和姓氏,我想检查该句子中是否有我的 first-name
或 last-name
dataset
, 这是我的 dataset
l-name f-name
میلاد جورابلو
علی احمدی
امیر احمدی
这是 sentence
样本
sentence = 'امروز با میلاد احمدی رفتم بیرون'
英文版 dataset
l-name f-name
Smith John
Johnson Anthony
Williams Ethan
这是英文版的句子
sentence = 'I am going out with John Williams today'
我希望我的输出是这样的
first_name = ['John']
last_name = ['Williams']
如果您想以一种天真的方式解决这个问题,您可以考虑正则表达式,但是这是基于所有名字和姓氏都大写的假设。
sentence = 'I am going out with John Williams today'
name = re.search(r"[A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+", sentence).group()
print(name) # Outputs: John Williams
这将搜索一个大写字母,后跟任意数量的小写字母,然后是 space,然后是前面模式的重复。
除此之外,您可以考虑使用命名实体识别 (NER),使用预构建的库来识别文本中的名称。请参阅此处了解更多详情。 https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/
编辑:
我要补充一点,如果同一个句子中有多个名字,你可以申请re.findall()
:
sentence = 'I am going out with John Williams and William Smith today'
names = re.findall(r"[A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+", sentence)
print(names) # Outputs: ['John Williams', 'William Smith']
只需从每一列中获取名称列表并检查字符串是否包含这些列表中的任何元素。
import pandas as pd
names = [['John', 'Smith'], ['Anthony', 'Johnson'], ['Ethan', 'Williams']]
df = pd.DataFrame(names, columns = ['f_name', 'l_name'])
fname_list = df['f_name'].to_list()
lname_list = df['l_name'].to_list()
sentence = 'I am going out with John Williams today'
sentence = sentence.split()
fname_exist = [e for e in sentence if(e in fname_list)]
lname_exist = [e for e in sentence if(e in lname_list)]
if(len(fname_exist) > 0 and len(lname_exist) > 0):
print('first name: ' + fname_exist[0])
print('last name name: ' + lname_exist[0])
输出:
first name: John
last name: Williams