将 ID 与一组不同的名称相匹配

Question

我有一个包含公司名称列表的数据集，以及它们各自的 ID。每家公司都有多个实例，其中一些出现不同。每个公司名称至少有一个具有 ID 的实例，但由于拼写不一致，并非所有实例都有。所有的公司都组合在一起。数据看起来像这样：

company_name                 id

T. Rowe Price Group
Group, T. Rowe Price         576
T. ROWE PRICE GROUP
Transatlantic, Inc           458
Transatlantic, Incorporated
Transatlantic, Inc           458

有什么好的方法可以将缺少 ID 的公司名称与正确的 ID 相匹配吗？

Answer 1

这是使用 pandas 的一种方法：

import pandas as pd
import numpy as np
import re
from collections import OrderedDict
# a function that splits a string into text and number
def my_splitter(s):
    return filter(None, re.split(r'(\d+)', s))
#reading the data as a dataframe from the file
df=pd.read_csv('dataset.txt',sep='\t',header=None,skiprows=1,names=['Name'])
join=[]
for i in range(len(df)):
    if len(my_splitter(df['Name'][i]))!=2:
        join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': 'na'})
    else:
        join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': my_splitter(df['Name'][i])[1]})
df_new=pd.DataFrame(join) 

diction=OrderedDict()
#creating a dictionary that stores the company name and ID
for i in range(len(df_new)):
    if df_new['ID'][i]!='na':
        diction[df_new['ID'][i]]=df_new['Name'][i].split()

for i in range(len(df_new)):
    if df_new['ID'][i]=='na':
        for j in diction:
            if bool(set(df_new['Name'][i].split()) & set(diction[j])):
                df_new['ID'][i]=j

print (df) # contents of the testing file read as a dataframe
print ("####################")
print (df_new)
#save result to a file - dataset.txt
df_new.to_csv('dataset.txt', sep='\t')

输出：

                              Name
0               T. Rowe Price Group
1  Group, T. Rowe Price         576
2               T. ROWE PRICE GROUP
3  Transatlantic, Inc           458
4       Transatlantic, Incorporated
5  Transatlantic, Inc           458
####################
    ID                           Name
0  576            T. Rowe Price Group
1  576  Group, T. Rowe Price         
2  576            T. ROWE PRICE GROUP
3  458  Transatlantic, Inc           
4  458    Transatlantic, Incorporated
5  458  Transatlantic, Inc

Answer 2

使用 NLTK，您可以将 company_names 转换为它们的根（从此处查找词干提取和词形还原示例 https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html），然后您可以为同一家公司提供相同的 ID。

将 ID 与一组不同的名称相匹配

Matching ID's to a varied set of names

python

fuzzy

matching

pandas