使用 pandas 查找两列之间的相似性分数

Finding similarity score between two columns using pandas

我有一个如下所示的数据框

ID,Region,Supplier,year,output
1,Test,Test1,2021,1
2,dummy,tUMMY,2022,1
3,dasho,MASHO,2022,1
4,dahp,ZYZE,2021,0
5,delphi,POQE,2021,1
6,kilby,Daasan,2021,1
7,sarby,abbas,2021,1

df = pd.read_clipboard(sep=',')

我的objective是

a) 比较两个列值并分配相似度分数。

所以,我尝试了以下方法

import difflib
[(len(difflib.get_close_matches(x, df['Region'], cutoff=0.6))>1)*1
 for x in df['Supplier']]

但是,这会使所有输出为“0”。表示小于截止值 0.6

但是,我希望我的输出如下所示

将每一列转换为小写并进行比较 >= 而不是 >(因为在此示例中最多有一个匹配)获取所需的输出:

from difflib import SequenceMatcher, get_close_matches
df['best_match'] = [x for x in df['Supplier'].str.lower() for x in get_close_matches(x, df['Region'].str.lower()) or ['']]
df['similarity_score'] = df.apply(lambda x: SequenceMatcher(None, x['Supplier'].lower(), x['best_match']).ratio(), axis=1)
df = df.assign(similarity_flag = df['similarity_score'].gt(0.6).astype(int)).drop(columns=['best_match'])

输出:

  ID  Region Supplier  year  output  similarity_score  similarity_flag
0   1    Test    Test1  2021       1          0.888889                1
1   2   dummy    tUMMY  2022       1          0.800000                1
2   3   dasho    MASHO  2022       1          0.800000                1
3   4    dahp     ZYZE  2021       0          0.000000                0
4   5  delphi     POQE  2021       1          0.000000                0
5   6   kilby   Daasan  2021       1          0.000000                0
6   7   sarby    abbas  2021       1          0.000000                0

更新了带有相似性标志和分数的答案(使用 difflib.SequenceMatcher

cutoff = 0.6

df['similarity_score'] = (
    df[['Region','Supplier']]
    .apply(lambda x: difflib.SequenceMatcher(None, x[0].lower(), x[1].lower()).ratio(), axis=1)
)

df['similarity_flag'] = (df['similarity_score'] >= cutoff).astype(int)

输出:

   ID  Region Supplier  year  output  similarity_score  similarity_flag
0   1    Test    Test1  2021       1          0.888889                1
1   2   dummy    tUMMY  2022       1          0.800000                1
2   3   dasho    MASHO  2022       1          0.800000                1
3   4    dahp     ZYZE  2021       0          0.000000                0
4   5  delphi     POQE  2021       1          0.200000                0
5   6   kilby   Daasan  2021       1          0.000000                0
6   7   sarby    abbas  2021       1          0.200000                0

尝试将 applylambdaaxis=1 一起使用:

df['similarity_flag'] = (
    df[['Region','Supplier']]
    .apply(lambda x: len(difflib.get_close_matches(x[0].lower(), [x[1].lower()])), axis=1)
)

输出:

   ID  Region Supplier  year  output  similarity_flag
0   1    Test    Test1  2021       1                1
1   2   dummy    tUMMY  2022       1                1
2   3   dasho    MASHO  2022       1                1
3   4    dahp     ZYZE  2021       0                0
4   5  delphi     POQE  2021       1                0
5   6   kilby   Daasan  2021       1                0
6   7   sarby    abbas  2021       1                0