FuzzyWuzzy python,将两个相同的短语匹配到不同的脚本中会带来不同的结果

FuzzyWuzzy python, matching two identical phrases in to diverse scripts brought diverse result

所以问题是。我写了一个脚本,使用 fuzzywuzzy

比较 dataPhrame 中的值
def check_match_principal_name(state):
    for i in range(len(ALL_SCHOOLS['Principal Name'])):
        for a in range(len(TOP100['Principal'])):
            matchADD = fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a])
            if matchADD > 90:
                print(ALL_SCHOOLS['Principal Name'][i]+' '+TOP100['Principal'][a])
                matchPRI.append(i)
                matchPRI100.append(a)
                print(ALL_SCHOOLS['Principal Name'][i])
                print(TOP100['Principal'][a])
    for i in matchPRI:
        ALL_SCHOOLS.loc[i, 'MatchPRI'] = 1

    for i in matchPRI100:
        TOP100.loc[i, 'MatchPRI'] = 1

    ALL_SCHOOLS.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/{state}1.xlsx')
    TOP100.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/top-100/{state}1.xlsx')
    matchPRI.clear()
    matchPRI100.clear()

它有效,我没有任何异常等等,但例如在上部脚本 fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a]) returnsKimberly Beukema - Ms. Kimberly Beukema = 91

在第二个脚本中是这样的:

from fuzzywuzzy import fuzz
match= fuzz.partial_token_sort_ratio('Kimberly Beukema','  Ms. Kimberly Beukema')
print(match)

它returns匹配=100

我不明白为什么值会发生变化?

token_sort_ratiopartial_token_sort_ratio都默认对这两个字符串进行预处理。这意味着它将字符串小写,删除非字母数字字符并修剪空格。所以在你的情况下它转换:

'Kimberly Beukema'
'  Ms. Kimberly Beukema'

'kimberly beukema'
'ms kimberly beukema'

在下一步中,他们都对两个字符串中的单词进行排序:

'beukema kimberly'
'beukema kimberly ms'

之后他们比较了两个字符串。对于此比较,token_sort_ratio 使用 ratio,而 partial_token_sort_ratio 使用 partial_ratio.

ratio 中需要删除 3 次才能将 'beukema kimberly ms' 转换为 'beukema kimberly'。由于字符串的总长度为 35,因此所得比率为 round(100 * (1 - 3 / 35)) = 91.

partial_ratio中计算出两个字符串的最优对齐方式ratio。在你的例子中 'beukema kimberly' 是 'beukema kimberly ms' 的子串,所以 'beukema kimberly' 和 'beukema kimberly' 之间的 ratio 被计算为 round(100 * (1 - 0 / 32)) = 100.