如果一列满足 100% 匹配最佳一列,则两列的 fuzzywuzzy 比率
fuzzywuzzy ratio of 2 columns if one column satisfies 100 percent match the best one
我的数据框是
Matcher = df2['Account Name']
match = if df1['Billing Country'] == df2['Billing Country'] (process.extractOne(df1['Account Name'], Matcher))
上面的代码不行,但是我只想在国家匹配的时候做账户名的模糊匹配。
这就是我的建议。首先,对两个 dfs 进行完全笛卡尔连接:
df1.loc[:, 'MergeKey'] = 1 #create a mergekey
df2.loc[:, 'MergeKey'] = 1 #it is the same for both so that when you merge you get the cartesian product
#merge them to get the cartesian product (all possible combos)
merged = df1.merge(df2, on = 'MergeKey', suffixes = ['_1', '_2'])
然后,计算每个组合的模糊率:
def fuzzratio(row):
try: #avoid errors for example on NaN's
return fuzz.ratio(row['Billing Country_1'], row['Billing Country_2'])
except:
return 0. #you'll want to expiriment w/o the try/except too
merged.loc[:, 'Ratio'] = merged.apply(fuzzratio, axis = 1) #create ratio column by applying function
现在您应该有一个 df,其中包含 df1['Billing Country']
和 df2['Billing Country']
的所有可能组合之间的比率。到达那里后,只需过滤即可获得比率为 100% 的那些:
result = merged[merged.Ratio ==1]
我想出来的方式略有不同。
首先我使用
合并
merged_file = pd.merge(df2, df1, on='Billing Country', how = 'left')
当我找到所有可能的匹配项时。
我应用了 fuzzywuzzy 的
`Reference_data= df2['Account Name']`
`Result = process.extractOne(df1, choices)`
由于上面的字符串为我提供了与我要查找的每个值最接近的可能匹配项。
后来为了计算比例又加了一串。
Result['ratio']= fuzz.ratio(Result['Account Name_x'],Result['Account Name_y'] )
我的数据框是
Matcher = df2['Account Name']
match = if df1['Billing Country'] == df2['Billing Country'] (process.extractOne(df1['Account Name'], Matcher))
上面的代码不行,但是我只想在国家匹配的时候做账户名的模糊匹配。
这就是我的建议。首先,对两个 dfs 进行完全笛卡尔连接:
df1.loc[:, 'MergeKey'] = 1 #create a mergekey
df2.loc[:, 'MergeKey'] = 1 #it is the same for both so that when you merge you get the cartesian product
#merge them to get the cartesian product (all possible combos)
merged = df1.merge(df2, on = 'MergeKey', suffixes = ['_1', '_2'])
然后,计算每个组合的模糊率:
def fuzzratio(row):
try: #avoid errors for example on NaN's
return fuzz.ratio(row['Billing Country_1'], row['Billing Country_2'])
except:
return 0. #you'll want to expiriment w/o the try/except too
merged.loc[:, 'Ratio'] = merged.apply(fuzzratio, axis = 1) #create ratio column by applying function
现在您应该有一个 df,其中包含 df1['Billing Country']
和 df2['Billing Country']
的所有可能组合之间的比率。到达那里后,只需过滤即可获得比率为 100% 的那些:
result = merged[merged.Ratio ==1]
我想出来的方式略有不同。
首先我使用
合并merged_file = pd.merge(df2, df1, on='Billing Country', how = 'left')
当我找到所有可能的匹配项时。
我应用了 fuzzywuzzy 的
`Reference_data= df2['Account Name']`
`Result = process.extractOne(df1, choices)`
由于上面的字符串为我提供了与我要查找的每个值最接近的可能匹配项。 后来为了计算比例又加了一串。
Result['ratio']= fuzz.ratio(Result['Account Name_x'],Result['Account Name_y'] )