FuzzyWuzzy python,将两个相同的短语匹配到不同的脚本中会带来不同的结果
FuzzyWuzzy python, matching two identical phrases in to diverse scripts brought diverse result
所以问题是。我写了一个脚本,使用 fuzzywuzzy
比较 dataPhrame 中的值
def check_match_principal_name(state):
for i in range(len(ALL_SCHOOLS['Principal Name'])):
for a in range(len(TOP100['Principal'])):
matchADD = fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a])
if matchADD > 90:
print(ALL_SCHOOLS['Principal Name'][i]+' '+TOP100['Principal'][a])
matchPRI.append(i)
matchPRI100.append(a)
print(ALL_SCHOOLS['Principal Name'][i])
print(TOP100['Principal'][a])
for i in matchPRI:
ALL_SCHOOLS.loc[i, 'MatchPRI'] = 1
for i in matchPRI100:
TOP100.loc[i, 'MatchPRI'] = 1
ALL_SCHOOLS.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/{state}1.xlsx')
TOP100.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/top-100/{state}1.xlsx')
matchPRI.clear()
matchPRI100.clear()
它有效,我没有任何异常等等,但例如在上部脚本 fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a])
returnsKimberly Beukema - Ms. Kimberly Beukema = 91
在第二个脚本中是这样的:
from fuzzywuzzy import fuzz
match= fuzz.partial_token_sort_ratio('Kimberly Beukema',' Ms. Kimberly Beukema')
print(match)
它returns匹配=100
我不明白为什么值会发生变化?
token_sort_ratio
和partial_token_sort_ratio
都默认对这两个字符串进行预处理。这意味着它将字符串小写,删除非字母数字字符并修剪空格。所以在你的情况下它转换:
'Kimberly Beukema'
' Ms. Kimberly Beukema'
到
'kimberly beukema'
'ms kimberly beukema'
在下一步中,他们都对两个字符串中的单词进行排序:
'beukema kimberly'
'beukema kimberly ms'
之后他们比较了两个字符串。对于此比较,token_sort_ratio
使用 ratio
,而 partial_token_sort_ratio
使用 partial_ratio
.
在 ratio
中需要删除 3 次才能将 'beukema kimberly ms' 转换为 'beukema kimberly'。由于字符串的总长度为 35,因此所得比率为 round(100 * (1 - 3 / 35)) = 91
.
在partial_ratio
中计算出两个字符串的最优对齐方式ratio
。在你的例子中 'beukema kimberly' 是 'beukema kimberly ms' 的子串,所以 'beukema kimberly' 和 'beukema kimberly' 之间的 ratio
被计算为 round(100 * (1 - 0 / 32)) = 100
.
所以问题是。我写了一个脚本,使用 fuzzywuzzy
比较 dataPhrame 中的值def check_match_principal_name(state):
for i in range(len(ALL_SCHOOLS['Principal Name'])):
for a in range(len(TOP100['Principal'])):
matchADD = fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a])
if matchADD > 90:
print(ALL_SCHOOLS['Principal Name'][i]+' '+TOP100['Principal'][a])
matchPRI.append(i)
matchPRI100.append(a)
print(ALL_SCHOOLS['Principal Name'][i])
print(TOP100['Principal'][a])
for i in matchPRI:
ALL_SCHOOLS.loc[i, 'MatchPRI'] = 1
for i in matchPRI100:
TOP100.loc[i, 'MatchPRI'] = 1
ALL_SCHOOLS.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/{state}1.xlsx')
TOP100.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/top-100/{state}1.xlsx')
matchPRI.clear()
matchPRI100.clear()
它有效,我没有任何异常等等,但例如在上部脚本 fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a])
returnsKimberly Beukema - Ms. Kimberly Beukema = 91
在第二个脚本中是这样的:
from fuzzywuzzy import fuzz
match= fuzz.partial_token_sort_ratio('Kimberly Beukema',' Ms. Kimberly Beukema')
print(match)
它returns匹配=100
我不明白为什么值会发生变化?
token_sort_ratio
和partial_token_sort_ratio
都默认对这两个字符串进行预处理。这意味着它将字符串小写,删除非字母数字字符并修剪空格。所以在你的情况下它转换:
'Kimberly Beukema'
' Ms. Kimberly Beukema'
到
'kimberly beukema'
'ms kimberly beukema'
在下一步中,他们都对两个字符串中的单词进行排序:
'beukema kimberly'
'beukema kimberly ms'
之后他们比较了两个字符串。对于此比较,token_sort_ratio
使用 ratio
,而 partial_token_sort_ratio
使用 partial_ratio
.
在 ratio
中需要删除 3 次才能将 'beukema kimberly ms' 转换为 'beukema kimberly'。由于字符串的总长度为 35,因此所得比率为 round(100 * (1 - 3 / 35)) = 91
.
在partial_ratio
中计算出两个字符串的最优对齐方式ratio
。在你的例子中 'beukema kimberly' 是 'beukema kimberly ms' 的子串,所以 'beukema kimberly' 和 'beukema kimberly' 之间的 ratio
被计算为 round(100 * (1 - 0 / 32)) = 100
.