模糊匹配列和 merge/join 数据框
Fuzzy match columns and merge/join dataframes
我正在尝试将 2 个数据帧与多个列合并,每个列基于每个列中的一个列的匹配值。来自@Erfan 的这段代码在模糊匹配目标列方面做得很好,但是也有办法携带其余列。
数据框
df1 = pd.DataFrame({'Key':['Apple Souce', 'Banana', 'Orange', 'Strawberry', 'John tabel']})
df2 = pd.DataFrame({'Key':['Aple suce', 'Mango', 'Orag','Jon table', 'Straw', 'Bannanna', 'Berry'],
'Key23':['1', '2', '3','4', '5', '6', '7'})
来自@Erfan 的匹配函数,如上文link所述
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match, based on Levenshtein distance
limit is the amount of matches that will get returned, these are sorted high to low
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
调用函数
df = fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80, limit=1)
df.sort_values(by='Key',ascending=True).reset_index()
结果
index Key matches
0 Apple Souce Aple suce
1 Banana Bannanna
2 John tabel
3 Orange
4 Strawberry Straw
想要的结果
index Key matches Key23
0 Apple Souce Aple suce 1
1 Banana Bannanna 6
2 John tabel
3 Orange
4 Strawberry Straw 5
对于那些需要这个的人。这是我想出的解决方案。
merge = pd.merge(df, df2, left_on=['matches'],right_on=['Key'],how='outer').fillna(0)
从那里您可以删除不必要的或重复的列并获得干净的结果,如下所示:
clean = merge.drop(['matches', 'Key_y'], axis=1)
我正在尝试将 2 个数据帧与多个列合并,每个列基于每个列中的一个列的匹配值。来自@Erfan 的这段代码在模糊匹配目标列方面做得很好,但是也有办法携带其余列。
数据框
df1 = pd.DataFrame({'Key':['Apple Souce', 'Banana', 'Orange', 'Strawberry', 'John tabel']})
df2 = pd.DataFrame({'Key':['Aple suce', 'Mango', 'Orag','Jon table', 'Straw', 'Bannanna', 'Berry'],
'Key23':['1', '2', '3','4', '5', '6', '7'})
来自@Erfan 的匹配函数,如上文link所述
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match, based on Levenshtein distance
limit is the amount of matches that will get returned, these are sorted high to low
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
调用函数
df = fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80, limit=1)
df.sort_values(by='Key',ascending=True).reset_index()
结果
index Key matches
0 Apple Souce Aple suce
1 Banana Bannanna
2 John tabel
3 Orange
4 Strawberry Straw
想要的结果
index Key matches Key23
0 Apple Souce Aple suce 1
1 Banana Bannanna 6
2 John tabel
3 Orange
4 Strawberry Straw 5
对于那些需要这个的人。这是我想出的解决方案。
merge = pd.merge(df, df2, left_on=['matches'],right_on=['Key'],how='outer').fillna(0)
从那里您可以删除不必要的或重复的列并获得干净的结果,如下所示:
clean = merge.drop(['matches', 'Key_y'], axis=1)