当感兴趣的列缺少单元格时，如何使用 difflab 创建人工键列以合并两个数据集？

Question

目标：如果第i行df2中的名字是df1中某个名字的子串或完全匹配df1 中的某些行 N 和第 N 行的州和地区列与 df2 第 i 行的相应州和地区列相匹配，合并。

有人建议我使用 difflib 创建一个人工键列以进行合并。

这个新专栏名为 'name'。 difflib.get_close_matches 在 df2 中寻找相似的字符串。

当 'CandidateName' 列中的所有行都存在时效果很好，但我得到 IndexError: list index out of range when a cell is missing.

我尝试通过在空列中填充字符串 'EMPTY' 来解决这个问题。但是还是出现同样的错误。

# I used this method to replace empty cells
df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')


# I then proceeded to run the line again
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])

# Data Frame Samples

# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara  Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)

print df1

#        CandidateName         District   Party          State
#0  Theodorick A. Bland           9                       VA
#1  Aedanus Rutherford Burke      2                       SC
#2  Jason Lewis                   2       Democrat        MN
#3  Barbara Comstock             10       Democrat        VA
#4  Theodorick Bland              9                       VA
#5  Aedanus Burke                 2                       SC
#6  Jason Initial Lewis           2         Democrat      MN
#7  ''                            1         Whig          NH
#8  ''                            1         Whig          NH

Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)

print df2

#   CandidateName        District   Party      State
#0  Theodorick Bland        9                   VA
#1  Aedanus Burke           2                   SC
#2  Jason Lewis             2       Democrat    MN
#3  Barbara Comstock        10      Democrat    VA

import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])

df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])

预计

print(df1)
#              CandidateName State  District     Party              Name
#0       Theodorick A. Bland    VA         9            Theodorick Bland
#1  Aedanus Rutherford Burke    SC         2               Aedanus Burke
#2               Jason Lewis    MN         2                 Jason Lewis
#3         Barbara  Comstock    VA        10  Democrat  Barbara Comstock
#4          Theodorick Bland    VA         9            Theodorick Bland
#5             Aedanus Burke    SC         2               Aedanus Burke
#6       Jason Initial Lewis    MN         2  Democrat       Jason Lewis
#7                              NH         1      Whig    
#8                              NH         1      Whig

实际错误结果：

-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
---> 23 df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])

IndexError: list index out of range

Answer 1

您将返回一个 list 类型的对象。这些列表没有索引 0。这就是您收到此错误的原因。其次，我们需要将这些 lists 转换为类型 string 以便能够像下面这样进行合并：

注意：你不必使用：df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')

import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))

df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')

print(df_merge)
              CandidateName State  District     Party              Name
0       Theodorick A. Bland    VA         9            Theodorick Bland
1  Aedanus Rutherford Burke    SC         2               Aedanus Burke
2               Jason Lewis    MN         2                 Jason Lewis
3         Barbara  Comstock    VA        10  Democrat  Barbara Comstock
4          Theodorick Bland    VA         9            Theodorick Bland
5             Aedanus Burke    SC         2               Aedanus Burke
6       Jason Initial Lewis    MN         2  Democrat       Jason Lewis
7                              NH         1      Whig                  
8                              NH         1      Whig

注意我在 merge 中添加了 how='left' 参数，因为您想保持原始数据框的形状。

''.join()
的解释我们这样做是为了将列表转换为字符串，请参见示例：

lst = ['hello', 'world']

print(' '.join(lst))
'hello world'

当感兴趣的列缺少单元格时，如何使用 difflab 创建人工键列以合并两个数据集？

How can I create an artificial key column for merging two datasets using difflab when the column of interest has missing cells?

python

regex

difflib

python-2.7

pandas