当感兴趣的列缺少单元格时,如何使用 difflab 创建人工键列以合并两个数据集?
How can I create an artificial key column for merging two datasets using difflab when the column of interest has missing cells?
目标:如果第i行df2中的名字是df1中某个名字的子串或完全匹配df1 中的某些行 N 和第 N 行的州和地区列与 df2 第 i 行的相应州和地区列相匹配,合并。
有人建议我使用 difflib 创建一个人工键列以进行合并。
这个新专栏名为 'name'。 difflib.get_close_matches 在 df2 中寻找相似的字符串。
当 'CandidateName' 列中的所有行都存在时效果很好,但我得到 IndexError: list index out of range when a cell is missing.
我尝试通过在空列中填充字符串 'EMPTY' 来解决这个问题。但是还是出现同样的错误。
# I used this method to replace empty cells
df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
# I then proceeded to run the line again
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
预计
print(df1)
# CandidateName State District Party Name
#0 Theodorick A. Bland VA 9 Theodorick Bland
#1 Aedanus Rutherford Burke SC 2 Aedanus Burke
#2 Jason Lewis MN 2 Jason Lewis
#3 Barbara Comstock VA 10 Democrat Barbara Comstock
#4 Theodorick Bland VA 9 Theodorick Bland
#5 Aedanus Burke SC 2 Aedanus Burke
#6 Jason Initial Lewis MN 2 Democrat Jason Lewis
#7 NH 1 Whig
#8 NH 1 Whig
实际错误结果:
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
---> 23 df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
IndexError: list index out of range
您将返回一个 list
类型的对象。这些列表没有索引 0
。这就是您收到此错误的原因。其次,我们需要将这些 lists
转换为类型 string
以便能够像下面这样进行合并:
注意:你不必使用:df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis
7 NH 1 Whig
8 NH 1 Whig
注意 我在 merge
中添加了 how='left'
参数,因为您想保持原始数据框的形状。
''.join()
的解释
我们这样做是为了将列表转换为字符串,请参见示例:
lst = ['hello', 'world']
print(' '.join(lst))
'hello world'
目标:如果第i行df2中的名字是df1中某个名字的子串或完全匹配df1 中的某些行 N 和第 N 行的州和地区列与 df2 第 i 行的相应州和地区列相匹配,合并。
有人建议我使用 difflib 创建一个人工键列以进行合并。
这个新专栏名为 'name'。 difflib.get_close_matches 在 df2 中寻找相似的字符串。
当 'CandidateName' 列中的所有行都存在时效果很好,但我得到 IndexError: list index out of range when a cell is missing.
我尝试通过在空列中填充字符串 'EMPTY' 来解决这个问题。但是还是出现同样的错误。
# I used this method to replace empty cells
df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
# I then proceeded to run the line again
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
预计
print(df1)
# CandidateName State District Party Name
#0 Theodorick A. Bland VA 9 Theodorick Bland
#1 Aedanus Rutherford Burke SC 2 Aedanus Burke
#2 Jason Lewis MN 2 Jason Lewis
#3 Barbara Comstock VA 10 Democrat Barbara Comstock
#4 Theodorick Bland VA 9 Theodorick Bland
#5 Aedanus Burke SC 2 Aedanus Burke
#6 Jason Initial Lewis MN 2 Democrat Jason Lewis
#7 NH 1 Whig
#8 NH 1 Whig
实际错误结果:
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
---> 23 df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
IndexError: list index out of range
您将返回一个 list
类型的对象。这些列表没有索引 0
。这就是您收到此错误的原因。其次,我们需要将这些 lists
转换为类型 string
以便能够像下面这样进行合并:
注意:你不必使用:df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis
7 NH 1 Whig
8 NH 1 Whig
注意 我在 merge
中添加了 how='left'
参数,因为您想保持原始数据框的形状。
''.join()
的解释
我们这样做是为了将列表转换为字符串,请参见示例:
lst = ['hello', 'world']
print(' '.join(lst))
'hello world'