计算模糊字符串匹配的最高分

Calculate highest score in fuzzy string matching

希望通过使用模糊字符串匹配找到 2 列值之间的最高准确度百分比。

我有 2 个数据框,我试图在两个数据框的特定列值之间使用模糊匹配。

假设 df1 有 5 行,df2 有 4 行,我想选择 df1 每一行的值并与 df2 每一行匹配并找到最高精度。假设 df1 中的 Row1 已与 df2 中的所有行进行了比较,因此无论 df2 中哪一行的准确度最高,我们都将其视为输出。 df1.

中的每一行都要考虑同样的问题

输入数据:

Dataframe1

id_number  company_name        match_acc

IN2231D    AXN pvt Ltd
UK654IN    Aviva Intl Ltd
SL1432H    Ship Incorporations
LK0678G    Oppo Mobiles pvt ltd
NG5678J    Nokia Inc

Dataframe2

identity_no   Pincode   company_name

 IN2231        110030    AXN pvt Ltd
 UK654IN       897653    Aviva Intl Ltd
 SL1432        07658     Ship Incorporations
 LK0678G       120988    Oppo Mobiles Pvt Ltd

想要找到最高准确率并提交 match_acc 列中的值。

我目前使用的代码:

df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')


from fuzzywuzzy import fuzz 
for index, row in df1.iterrows():
   df1['match_acc']= fuzz.partial_ratio(df1['id_number'], df2['identity_no'])

print(df1['match_acc'])

我一直在用Fuzzywuzzy,如果有其他方法也请推荐。

任何建议。

TL;DR

  1. 交叉mergedf1.id_numberdf2.identity_no
  2. 计算每对的 thefuzz.fuzz.ratio or the faster rapidfuzz.fuzz.ratio
  3. map the groupby.max 比率回到 df1
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())

#   id_number          company_name  match_acc
# 0   IN2231D           AXN pvt Ltd         92
# 1   UK654IN        Aviva Intl Ltd        100
# 2   SL1432H   Ship Incorporations         92
# 3   LK0678G  Oppo Mobiles pvt ltd        100
# 4   NG5678J             Nokia Inc         43

详情

  1. mergecross方法产生df1.id_numberdf2.identity_no的笛卡尔积:

    cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
    
    #    id_number identity_no
    # 0    IN2231D      IN2231
    # 1    IN2231D     UK654IN
    # 2    IN2231D      SL1432
    # ...
    # 17   NG5678J     UK654IN
    # 18   NG5678J      SL1432
    # 19   NG5678J     LK0678G
    

    对于 pandas < 1.2,how='cross' 不可用,因此在临时密钥上使用 how='outer'

    cross = df1[['id_number']].assign(tmp=0).merge(df2[['identity_no']].assign(tmp=0), how='outer', on='tmp').drop(columns='tmp')
    
  2. apply 成对模糊计算器:

    cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
    
    #    id_number identity_no  match_acc
    # 0    IN2231D      IN2231         92
    # 1    IN2231D     UK654IN         29
    # 2    IN2231D      SL1432         15
    # ...
    # 17   NG5678J     UK654IN         14
    # 18   NG5678J      SL1432          0
    # 19   NG5678J     LK0678G         43
    
  3. groupby.max to get the max scores per id_number and map把它们变成df1.match_acc:

    df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())
    
    #   id_number          company_name  match_acc
    # 0   IN2231D           AXN pvt Ltd         92
    # 1   UK654IN        Aviva Intl Ltd        100
    # 2   SL1432H   Ship Incorporations         92
    # 3   LK0678G  Oppo Mobiles pvt ltd        100
    # 4   NG5678J             Nokia Inc         43
    

可以使用fuzzywuzzyprocess函数进行一对多操作。 此外,使用 rapidfuzz 而不是具有相同功能的 fuzzywuzzy,但它会根据字符串算法执行一些预处理以提供更快的结果。

pip install rapidfuzz

# from fuzzywuzzy import fuzz, process
from rapidfuzz import fuzz, process # --> Use this for drastic exponential execution time improvements

df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')


for index, row in df1.iterrows():
    #extractOne will automatically extract the best one from the list of choices
    # you can provide which fuzzywuzzy scorer to use as well

    df1['match_acc']= process.extractOne(query=row['id_number'], choices=df2['identity_no'].tolist(), scorer=fuzz.partial_ratio)
print(df1['match_acc'])