如何计算两个数据帧中所有行之间的 Levenshtein 距离并输出每对的 Levenshtein 分数?

How can I calculate the Levenshtein distance between all rows in two dataframes and output the Levenshtein score for each pair?

我正在尝试计算两个数据帧(dfa 和 dfb)之间的 Levenshtein 距离,如下所示。

dfa:

Name      Addresss     ID  
Name1a    Address1a    ID1a
Name2a    Address2a    ID2a

dfb:

Name      Addresss      ID  
Name1b    Address1b   ID1b
Name2b    Address2b   ID2b

我理解计算两个字符串之间的距离,但我有点困惑如何将一组列与另一列进行对比,输出看起来像这样,显示所有对和分数:

输出:

Name      Name      LevScore
Name1a    Name1b       0.87
Name1a    Name2b       0.45
Name1a    Name3b       0.26
Name2a    Name1b       0.92
Name2a    Name2b       0.67
Name2a    Name3b       0.56
etc

提前致谢!

马内什

试试这个:

import pandas as pd
from textdistance import levenshtein
from itertools import product

# dfa = pd.read_clipboard()  # this is just to reproduce your dataframe

# dfb = pd.read_clipboard()  # this is just to reproduce your dataframe

dfc = pd.DataFrame(product(dfa['Name'], dfb['Name']), columns=['Name1', 'Name2'])

dfc['Distance'] = dfc.apply(lambda x: levenshtein.distance(x['Name1'],
                                                           x['Name2']), axis=1)
    Name1   Name2  Distance
0  Name1a  Name1b         1
1  Name1a  Name2b         2
2  Name2a  Name1b         2
3  Name2a  Name2b         1

您可以将包 Levenshteinitertools 一起使用以获得两列值的组合:

import Levenshtein as lev
from itertools import product

new_df = pd.DataFrame(product(df1['Name'], df2['Name']), columns=["Name1","Name2"])

new_df["LevScore"] = new_df.apply(lambda x: lev.score(x[0],x[1]), axis=1)

print(new_df)

    Name1   Name2   LevScore
0   Name1a  Name1b  1
1   Name1a  Name2b  2
2   Name2a  Name1b  2
3   Name2a  Name2b  1

编辑

假设这是你的 df1:

df1_n = pd.concat([df1,df1,df1]).reset_index(drop=True)
df1_n

Name    Addresss    ID
0   Name1a  Address1a   ID1a
1   Name2a  Address2a   ID2a
2   Name1a  Address1a   ID1a
3   Name2a  Address2a   ID2a
4   Name1a  Address1a   ID1a
5   Name2a  Address2a   ID2a

正如您所说,您可以从 df1_n:

中获取大小为 step 的块来计算值的组合
fina_df = pd.DataFrame()
step=2
for i in range(0,df1_n.shape[0],step):
    new_df = pd.DataFrame(product(df1_n.iloc[i:i+step,0], df2['Name']), columns=["Name1","Name2"])
    new_df["LevScore"] = new_df.apply(lambda x: lev.distance(x[0],x[1]), axis=1)
    fina_df = pd.concat([fina_df, new_df], axis=0).reset_index(drop=True)

print(final_df)

输出:

Name1   Name2   LevScore
0   Name1a  Name1b  1
1   Name1a  Name2b  2
2   Name2a  Name1b  2
3   Name2a  Name2b  1
4   Name1a  Name1b  1
5   Name1a  Name2b  2
6   Name2a  Name1b  2
7   Name2a  Name2b  1
8   Name1a  Name1b  1
9   Name1a  Name2b  2
10  Name2a  Name1b  2
11  Name2a  Name2b  1

根据您的情况,将 2 更改为 300 或 500。这应该可以避免填满您的整个 RAM,如果可行,请告诉我!