Dataframe 1 中的多个拼写结果

Question

我有一些包含拼写错误的数据。我正在更正它们并使用以下代码对拼写的接近程度进行评分：

 import pandas as pd
 import difflib

 Li_A = ["potato", "tomato", "squash", "apple", "pear"]

 Q    = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
         'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}

 df_Q = pd.DataFrame(Q)

 # Define the function that Corrects & Scores the Spelling
 def Spelling(ask):
     a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)

     # List comprehension for all values of a
     b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
     return pd.Series(a + b)

 # Apply the function that Corrects & Scores the Spelling
 df_A = df_Q['one'].apply(Spelling)

 # Get the column names on the A dataframe
 c = len(df_A.columns) // 2
 df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
                ['Score_{}'.format(y)    for y in range(c)]

 # Join the Q & A dataframes
 df_QA = df_Q.join(df_A)

这给出了结果：

 df_QA
       one     two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4  \
 a  potat0  po1ato     potato     tomato       pear      apple     squash   
 b  toma3o  2omato     tomato     potato       pear      apple     squash   
 c  s5uash  squ0sh     squash       pear      apple     tomato     potato   
 d   ap8le   2pple      apple       pear     tomato     squash     potato   
 e    pea7    p3ar       pear     potato      apple     tomato     squash   

     Score_0   Score_1   Score_2   Score_3   Score_4  
 a  0.833333  0.500000  0.400000  0.181818  0.166667  
 b  0.833333  0.333333  0.200000  0.181818  0.166667  
 c  0.833333  0.200000  0.181818  0.166667  0.166667  
 d  0.800000  0.222222  0.181818  0.181818  0.181818  
 e  0.750000  0.400000  0.444444  0.200000  0.200000

对于第"e"行，"potato"在第1行，"apple"在第2行。但是，苹果的得分高于马铃薯。这是我申请的错误方式。

如何获得更高的得分结果，请始终向左？

编辑 1：我尝试了一个更简单的代码：

 import difflib
 Li_A = ["potato", "tomato", "squash", "apple", "pear"]
 Q    = "pea7"
 A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)

& 得到相同的结果：

 A: ['pear', 'potato', 'apple', 'tomato', 'squash']

我也尝试了一个更简单的评分代码：

 import difflib
 S1 = difflib.SequenceMatcher(None, "pea7", "potato")
 R1 = S1.ratio()
 S2 = difflib.SequenceMatcher(None, "pea7", "apple")
 R2 = S2.ratio()

& 我又得到了相同的结果：

 R1: 0.4
 R2: 0.444

编辑 2 我用 fuzzywuzzy 试过了。我又得到了相同的结果，因为 fuzzywuzzy 依赖于 difflib:

 from fuzzywuzzy import fuzz
 R1 = fuzz.ratio("pea7", "potato")
 R2 = fuzz.ratio("pea7", "apple")

Answer 1

SequenceMatcher 使用 Ratcliff 和 Metzener，1988 年描述的方法正确计算了比率。也就是说，对于在两个字符串中找到的共同字符数 (CC) 和字符总数 (CT)：

ratio = 2.CC/CT

看来问题出在 get_close_matches

Dataframe 1 中的多个拼写结果

Multiple Spelling Results in a Dataframe 1

python

spelling

difflib

dataframe

fuzzywuzzy