计算模糊字符串匹配的最高分
Calculate highest score in fuzzy string matching
希望通过使用模糊字符串匹配找到 2 列值之间的最高准确度百分比。
我有 2 个数据框,我试图在两个数据框的特定列值之间使用模糊匹配。
假设 df1 有 5 行,df2 有 4 行,我想选择 df1 每一行的值并与 df2 每一行匹配并找到最高精度。假设 df1 中的 Row1 已与 df2 中的所有行进行了比较,因此无论 df2 中哪一行的准确度最高,我们都将其视为输出。 df1.
中的每一行都要考虑同样的问题
输入数据:
Dataframe1
id_number company_name match_acc
IN2231D AXN pvt Ltd
UK654IN Aviva Intl Ltd
SL1432H Ship Incorporations
LK0678G Oppo Mobiles pvt ltd
NG5678J Nokia Inc
Dataframe2
identity_no Pincode company_name
IN2231 110030 AXN pvt Ltd
UK654IN 897653 Aviva Intl Ltd
SL1432 07658 Ship Incorporations
LK0678G 120988 Oppo Mobiles Pvt Ltd
想要找到最高准确率并提交 match_acc 列中的值。
我目前使用的代码:
df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')
from fuzzywuzzy import fuzz
for index, row in df1.iterrows():
df1['match_acc']= fuzz.partial_ratio(df1['id_number'], df2['identity_no'])
print(df1['match_acc'])
我一直在用Fuzzywuzzy,如果有其他方法也请推荐。
任何建议。
TL;DR
- 交叉
merge
df1.id_number
与df2.identity_no
- 计算每对的
thefuzz.fuzz.ratio
or the faster rapidfuzz.fuzz.ratio
map
the groupby.max
比率回到 df1
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())
# id_number company_name match_acc
# 0 IN2231D AXN pvt Ltd 92
# 1 UK654IN Aviva Intl Ltd 100
# 2 SL1432H Ship Incorporations 92
# 3 LK0678G Oppo Mobiles pvt ltd 100
# 4 NG5678J Nokia Inc 43
详情
merge
的cross
方法产生df1.id_number
和df2.identity_no
的笛卡尔积:
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
# id_number identity_no
# 0 IN2231D IN2231
# 1 IN2231D UK654IN
# 2 IN2231D SL1432
# ...
# 17 NG5678J UK654IN
# 18 NG5678J SL1432
# 19 NG5678J LK0678G
对于 pandas < 1.2,how='cross'
不可用,因此在临时密钥上使用 how='outer'
:
cross = df1[['id_number']].assign(tmp=0).merge(df2[['identity_no']].assign(tmp=0), how='outer', on='tmp').drop(columns='tmp')
apply
成对模糊计算器:
cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
# id_number identity_no match_acc
# 0 IN2231D IN2231 92
# 1 IN2231D UK654IN 29
# 2 IN2231D SL1432 15
# ...
# 17 NG5678J UK654IN 14
# 18 NG5678J SL1432 0
# 19 NG5678J LK0678G 43
用groupby.max
to get the max scores per id_number
and map
把它们变成df1.match_acc
:
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())
# id_number company_name match_acc
# 0 IN2231D AXN pvt Ltd 92
# 1 UK654IN Aviva Intl Ltd 100
# 2 SL1432H Ship Incorporations 92
# 3 LK0678G Oppo Mobiles pvt ltd 100
# 4 NG5678J Nokia Inc 43
可以使用fuzzywuzzy
的process
函数进行一对多操作。
此外,使用 rapidfuzz
而不是具有相同功能的 fuzzywuzzy
,但它会根据字符串算法执行一些预处理以提供更快的结果。
pip install rapidfuzz
# from fuzzywuzzy import fuzz, process
from rapidfuzz import fuzz, process # --> Use this for drastic exponential execution time improvements
df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')
for index, row in df1.iterrows():
#extractOne will automatically extract the best one from the list of choices
# you can provide which fuzzywuzzy scorer to use as well
df1['match_acc']= process.extractOne(query=row['id_number'], choices=df2['identity_no'].tolist(), scorer=fuzz.partial_ratio)
print(df1['match_acc'])
希望通过使用模糊字符串匹配找到 2 列值之间的最高准确度百分比。
我有 2 个数据框,我试图在两个数据框的特定列值之间使用模糊匹配。
假设 df1 有 5 行,df2 有 4 行,我想选择 df1 每一行的值并与 df2 每一行匹配并找到最高精度。假设 df1 中的 Row1 已与 df2 中的所有行进行了比较,因此无论 df2 中哪一行的准确度最高,我们都将其视为输出。 df1.
中的每一行都要考虑同样的问题输入数据:
Dataframe1
id_number company_name match_acc
IN2231D AXN pvt Ltd
UK654IN Aviva Intl Ltd
SL1432H Ship Incorporations
LK0678G Oppo Mobiles pvt ltd
NG5678J Nokia Inc
Dataframe2
identity_no Pincode company_name
IN2231 110030 AXN pvt Ltd
UK654IN 897653 Aviva Intl Ltd
SL1432 07658 Ship Incorporations
LK0678G 120988 Oppo Mobiles Pvt Ltd
想要找到最高准确率并提交 match_acc 列中的值。
我目前使用的代码:
df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')
from fuzzywuzzy import fuzz
for index, row in df1.iterrows():
df1['match_acc']= fuzz.partial_ratio(df1['id_number'], df2['identity_no'])
print(df1['match_acc'])
我一直在用Fuzzywuzzy,如果有其他方法也请推荐。
任何建议。
TL;DR
- 交叉
merge
df1.id_number
与df2.identity_no
- 计算每对的
thefuzz.fuzz.ratio
or the fasterrapidfuzz.fuzz.ratio
map
thegroupby.max
比率回到df1
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())
# id_number company_name match_acc
# 0 IN2231D AXN pvt Ltd 92
# 1 UK654IN Aviva Intl Ltd 100
# 2 SL1432H Ship Incorporations 92
# 3 LK0678G Oppo Mobiles pvt ltd 100
# 4 NG5678J Nokia Inc 43
详情
merge
的cross
方法产生df1.id_number
和df2.identity_no
的笛卡尔积:cross = df1[['id_number']].merge(df2[['identity_no']], how='cross') # id_number identity_no # 0 IN2231D IN2231 # 1 IN2231D UK654IN # 2 IN2231D SL1432 # ... # 17 NG5678J UK654IN # 18 NG5678J SL1432 # 19 NG5678J LK0678G
对于 pandas < 1.2,
how='cross'
不可用,因此在临时密钥上使用how='outer'
:cross = df1[['id_number']].assign(tmp=0).merge(df2[['identity_no']].assign(tmp=0), how='outer', on='tmp').drop(columns='tmp')
apply
成对模糊计算器:cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1) # id_number identity_no match_acc # 0 IN2231D IN2231 92 # 1 IN2231D UK654IN 29 # 2 IN2231D SL1432 15 # ... # 17 NG5678J UK654IN 14 # 18 NG5678J SL1432 0 # 19 NG5678J LK0678G 43
用
groupby.max
to get the max scores perid_number
andmap
把它们变成df1.match_acc
:df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max()) # id_number company_name match_acc # 0 IN2231D AXN pvt Ltd 92 # 1 UK654IN Aviva Intl Ltd 100 # 2 SL1432H Ship Incorporations 92 # 3 LK0678G Oppo Mobiles pvt ltd 100 # 4 NG5678J Nokia Inc 43
可以使用fuzzywuzzy
的process
函数进行一对多操作。
此外,使用 rapidfuzz
而不是具有相同功能的 fuzzywuzzy
,但它会根据字符串算法执行一些预处理以提供更快的结果。
pip install rapidfuzz
# from fuzzywuzzy import fuzz, process
from rapidfuzz import fuzz, process # --> Use this for drastic exponential execution time improvements
df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')
for index, row in df1.iterrows():
#extractOne will automatically extract the best one from the list of choices
# you can provide which fuzzywuzzy scorer to use as well
df1['match_acc']= process.extractOne(query=row['id_number'], choices=df2['identity_no'].tolist(), scorer=fuzz.partial_ratio)
print(df1['match_acc'])