在数据框上使用 rapidfuzz
Using rapidfuzz on a dataframe
我有 4 列,分别是 BuisnessID、姓名、BuisnessID_y、Name_y,我想将姓名与 Name_y 匹配,相似度得分为 90%,如果不是 90%然后删除那些行。样本输入
df
BusinessID NAME BusinessID_y NAME_y
1013120869 MANOJ WANKHADE 1013404164 SLIMI
1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR
我是 python 的新手,不知道该怎么做。此外,我有 500k 条记录,因此任何其他快速模糊测试方法都很棒
>>> import pandas as pd
>>> import rapidfuzz
>>> df['matching_ratio'] = df.apply(lambda x:rapidfuzz.fuzz.ratio(x.NAME, x.NAME_y), axis=1).to_list()
>>> df
BusinessID NAME BusinessID_y NAME_y matching_ratio
0 1013120869 MANOJ WANKHADE 1013404164 SLIMI 10.526316
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
2 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
3 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677
>>> df[df.matching_ratio > 26] # change this '26' value to '90' as your requirmetn
BusinessID NAME BusinessID_y NAME_y matching_ratio
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677
我有 4 列,分别是 BuisnessID、姓名、BuisnessID_y、Name_y,我想将姓名与 Name_y 匹配,相似度得分为 90%,如果不是 90%然后删除那些行。样本输入
df
BusinessID NAME BusinessID_y NAME_y
1013120869 MANOJ WANKHADE 1013404164 SLIMI
1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR
我是 python 的新手,不知道该怎么做。此外,我有 500k 条记录,因此任何其他快速模糊测试方法都很棒
>>> import pandas as pd
>>> import rapidfuzz
>>> df['matching_ratio'] = df.apply(lambda x:rapidfuzz.fuzz.ratio(x.NAME, x.NAME_y), axis=1).to_list()
>>> df
BusinessID NAME BusinessID_y NAME_y matching_ratio
0 1013120869 MANOJ WANKHADE 1013404164 SLIMI 10.526316
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
2 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
3 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677
>>> df[df.matching_ratio > 26] # change this '26' value to '90' as your requirmetn
BusinessID NAME BusinessID_y NAME_y matching_ratio
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677