将 ID 与一组不同的名称相匹配
Matching ID's to a varied set of names
我有一个包含公司名称列表的数据集,以及它们各自的 ID。每家公司都有多个实例,其中一些出现不同。每个公司名称至少有一个具有 ID 的实例,但由于拼写不一致,并非所有实例都有。所有的公司都组合在一起。数据看起来像这样:
company_name id
T. Rowe Price Group
Group, T. Rowe Price 576
T. ROWE PRICE GROUP
Transatlantic, Inc 458
Transatlantic, Incorporated
Transatlantic, Inc 458
有什么好的方法可以将缺少 ID 的公司名称与正确的 ID 相匹配吗?
这是使用 pandas
的一种方法:
import pandas as pd
import numpy as np
import re
from collections import OrderedDict
# a function that splits a string into text and number
def my_splitter(s):
return filter(None, re.split(r'(\d+)', s))
#reading the data as a dataframe from the file
df=pd.read_csv('dataset.txt',sep='\t',header=None,skiprows=1,names=['Name'])
join=[]
for i in range(len(df)):
if len(my_splitter(df['Name'][i]))!=2:
join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': 'na'})
else:
join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': my_splitter(df['Name'][i])[1]})
df_new=pd.DataFrame(join)
diction=OrderedDict()
#creating a dictionary that stores the company name and ID
for i in range(len(df_new)):
if df_new['ID'][i]!='na':
diction[df_new['ID'][i]]=df_new['Name'][i].split()
for i in range(len(df_new)):
if df_new['ID'][i]=='na':
for j in diction:
if bool(set(df_new['Name'][i].split()) & set(diction[j])):
df_new['ID'][i]=j
print (df) # contents of the testing file read as a dataframe
print ("####################")
print (df_new)
#save result to a file - dataset.txt
df_new.to_csv('dataset.txt', sep='\t')
输出:
Name
0 T. Rowe Price Group
1 Group, T. Rowe Price 576
2 T. ROWE PRICE GROUP
3 Transatlantic, Inc 458
4 Transatlantic, Incorporated
5 Transatlantic, Inc 458
####################
ID Name
0 576 T. Rowe Price Group
1 576 Group, T. Rowe Price
2 576 T. ROWE PRICE GROUP
3 458 Transatlantic, Inc
4 458 Transatlantic, Incorporated
5 458 Transatlantic, Inc
使用 NLTK,您可以将 company_names 转换为它们的根(从此处查找词干提取和词形还原示例 https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html),然后您可以为同一家公司提供相同的 ID。
我有一个包含公司名称列表的数据集,以及它们各自的 ID。每家公司都有多个实例,其中一些出现不同。每个公司名称至少有一个具有 ID 的实例,但由于拼写不一致,并非所有实例都有。所有的公司都组合在一起。数据看起来像这样:
company_name id
T. Rowe Price Group
Group, T. Rowe Price 576
T. ROWE PRICE GROUP
Transatlantic, Inc 458
Transatlantic, Incorporated
Transatlantic, Inc 458
有什么好的方法可以将缺少 ID 的公司名称与正确的 ID 相匹配吗?
这是使用 pandas
的一种方法:
import pandas as pd
import numpy as np
import re
from collections import OrderedDict
# a function that splits a string into text and number
def my_splitter(s):
return filter(None, re.split(r'(\d+)', s))
#reading the data as a dataframe from the file
df=pd.read_csv('dataset.txt',sep='\t',header=None,skiprows=1,names=['Name'])
join=[]
for i in range(len(df)):
if len(my_splitter(df['Name'][i]))!=2:
join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': 'na'})
else:
join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': my_splitter(df['Name'][i])[1]})
df_new=pd.DataFrame(join)
diction=OrderedDict()
#creating a dictionary that stores the company name and ID
for i in range(len(df_new)):
if df_new['ID'][i]!='na':
diction[df_new['ID'][i]]=df_new['Name'][i].split()
for i in range(len(df_new)):
if df_new['ID'][i]=='na':
for j in diction:
if bool(set(df_new['Name'][i].split()) & set(diction[j])):
df_new['ID'][i]=j
print (df) # contents of the testing file read as a dataframe
print ("####################")
print (df_new)
#save result to a file - dataset.txt
df_new.to_csv('dataset.txt', sep='\t')
输出:
Name
0 T. Rowe Price Group
1 Group, T. Rowe Price 576
2 T. ROWE PRICE GROUP
3 Transatlantic, Inc 458
4 Transatlantic, Incorporated
5 Transatlantic, Inc 458
####################
ID Name
0 576 T. Rowe Price Group
1 576 Group, T. Rowe Price
2 576 T. ROWE PRICE GROUP
3 458 Transatlantic, Inc
4 458 Transatlantic, Incorporated
5 458 Transatlantic, Inc
使用 NLTK,您可以将 company_names 转换为它们的根(从此处查找词干提取和词形还原示例 https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html),然后您可以为同一家公司提供相同的 ID。