将 ID 与一组不同的名称相匹配

Matching ID's to a varied set of names

我有一个包含公司名称列表的数据集,以及它们各自的 ID。每家公司都有多个实例,其中一些出现不同。每个公司名称至少有一个具有 ID 的实例,但由于拼写不一致,并非所有实例都有。所有的公司都组合在一起。数据看起来像这样:

company_name                 id

T. Rowe Price Group
Group, T. Rowe Price         576
T. ROWE PRICE GROUP
Transatlantic, Inc           458
Transatlantic, Incorporated
Transatlantic, Inc           458

有什么好的方法可以将缺少 ID 的公司名称与正确的 ID 相匹配吗?

这是使用 pandas 的一种方法:

import pandas as pd
import numpy as np
import re
from collections import OrderedDict
# a function that splits a string into text and number
def my_splitter(s):
    return filter(None, re.split(r'(\d+)', s))
#reading the data as a dataframe from the file
df=pd.read_csv('dataset.txt',sep='\t',header=None,skiprows=1,names=['Name'])
join=[]
for i in range(len(df)):
    if len(my_splitter(df['Name'][i]))!=2:
        join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': 'na'})
    else:
        join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': my_splitter(df['Name'][i])[1]})
df_new=pd.DataFrame(join) 

diction=OrderedDict()
#creating a dictionary that stores the company name and ID
for i in range(len(df_new)):
    if df_new['ID'][i]!='na':
        diction[df_new['ID'][i]]=df_new['Name'][i].split()

for i in range(len(df_new)):
    if df_new['ID'][i]=='na':
        for j in diction:
            if bool(set(df_new['Name'][i].split()) & set(diction[j])):
                df_new['ID'][i]=j

print (df) # contents of the testing file read as a dataframe
print ("####################")
print (df_new)
#save result to a file - dataset.txt
df_new.to_csv('dataset.txt', sep='\t')

输出:

                              Name
0               T. Rowe Price Group
1  Group, T. Rowe Price         576
2               T. ROWE PRICE GROUP
3  Transatlantic, Inc           458
4       Transatlantic, Incorporated
5  Transatlantic, Inc           458
####################
    ID                           Name
0  576            T. Rowe Price Group
1  576  Group, T. Rowe Price         
2  576            T. ROWE PRICE GROUP
3  458  Transatlantic, Inc           
4  458    Transatlantic, Incorporated
5  458  Transatlantic, Inc   

使用 NLTK,您可以将 company_names 转换为它们的根(从此处查找词干提取和词形还原示例 https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html),然后您可以为同一家公司提供相同的 ID。