如何计算两个数据帧中所有行之间的 Levenshtein 距离并输出每对的 Levenshtein 分数?
How can I calculate the Levenshtein distance between all rows in two dataframes and output the Levenshtein score for each pair?
我正在尝试计算两个数据帧(dfa 和 dfb)之间的 Levenshtein 距离,如下所示。
dfa:
Name Addresss ID
Name1a Address1a ID1a
Name2a Address2a ID2a
dfb:
Name Addresss ID
Name1b Address1b ID1b
Name2b Address2b ID2b
我理解计算两个字符串之间的距离,但我有点困惑如何将一组列与另一列进行对比,输出看起来像这样,显示所有对和分数:
输出:
Name Name LevScore
Name1a Name1b 0.87
Name1a Name2b 0.45
Name1a Name3b 0.26
Name2a Name1b 0.92
Name2a Name2b 0.67
Name2a Name3b 0.56
etc
提前致谢!
马内什
试试这个:
import pandas as pd
from textdistance import levenshtein
from itertools import product
# dfa = pd.read_clipboard() # this is just to reproduce your dataframe
# dfb = pd.read_clipboard() # this is just to reproduce your dataframe
dfc = pd.DataFrame(product(dfa['Name'], dfb['Name']), columns=['Name1', 'Name2'])
dfc['Distance'] = dfc.apply(lambda x: levenshtein.distance(x['Name1'],
x['Name2']), axis=1)
Name1 Name2 Distance
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
您可以将包 Levenshtein
与 itertools
一起使用以获得两列值的组合:
import Levenshtein as lev
from itertools import product
new_df = pd.DataFrame(product(df1['Name'], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.score(x[0],x[1]), axis=1)
print(new_df)
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
编辑
假设这是你的 df1:
df1_n = pd.concat([df1,df1,df1]).reset_index(drop=True)
df1_n
Name Addresss ID
0 Name1a Address1a ID1a
1 Name2a Address2a ID2a
2 Name1a Address1a ID1a
3 Name2a Address2a ID2a
4 Name1a Address1a ID1a
5 Name2a Address2a ID2a
正如您所说,您可以从 df1_n
:
中获取大小为 step
的块来计算值的组合
fina_df = pd.DataFrame()
step=2
for i in range(0,df1_n.shape[0],step):
new_df = pd.DataFrame(product(df1_n.iloc[i:i+step,0], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.distance(x[0],x[1]), axis=1)
fina_df = pd.concat([fina_df, new_df], axis=0).reset_index(drop=True)
print(final_df)
输出:
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
4 Name1a Name1b 1
5 Name1a Name2b 2
6 Name2a Name1b 2
7 Name2a Name2b 1
8 Name1a Name1b 1
9 Name1a Name2b 2
10 Name2a Name1b 2
11 Name2a Name2b 1
根据您的情况,将 2 更改为 300 或 500。这应该可以避免填满您的整个 RAM,如果可行,请告诉我!
我正在尝试计算两个数据帧(dfa 和 dfb)之间的 Levenshtein 距离,如下所示。
dfa:
Name Addresss ID
Name1a Address1a ID1a
Name2a Address2a ID2a
dfb:
Name Addresss ID
Name1b Address1b ID1b
Name2b Address2b ID2b
我理解计算两个字符串之间的距离,但我有点困惑如何将一组列与另一列进行对比,输出看起来像这样,显示所有对和分数:
输出:
Name Name LevScore
Name1a Name1b 0.87
Name1a Name2b 0.45
Name1a Name3b 0.26
Name2a Name1b 0.92
Name2a Name2b 0.67
Name2a Name3b 0.56
etc
提前致谢!
马内什
试试这个:
import pandas as pd
from textdistance import levenshtein
from itertools import product
# dfa = pd.read_clipboard() # this is just to reproduce your dataframe
# dfb = pd.read_clipboard() # this is just to reproduce your dataframe
dfc = pd.DataFrame(product(dfa['Name'], dfb['Name']), columns=['Name1', 'Name2'])
dfc['Distance'] = dfc.apply(lambda x: levenshtein.distance(x['Name1'],
x['Name2']), axis=1)
Name1 Name2 Distance
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
您可以将包 Levenshtein
与 itertools
一起使用以获得两列值的组合:
import Levenshtein as lev
from itertools import product
new_df = pd.DataFrame(product(df1['Name'], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.score(x[0],x[1]), axis=1)
print(new_df)
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
编辑
假设这是你的 df1:
df1_n = pd.concat([df1,df1,df1]).reset_index(drop=True)
df1_n
Name Addresss ID
0 Name1a Address1a ID1a
1 Name2a Address2a ID2a
2 Name1a Address1a ID1a
3 Name2a Address2a ID2a
4 Name1a Address1a ID1a
5 Name2a Address2a ID2a
正如您所说,您可以从 df1_n
:
step
的块来计算值的组合
fina_df = pd.DataFrame()
step=2
for i in range(0,df1_n.shape[0],step):
new_df = pd.DataFrame(product(df1_n.iloc[i:i+step,0], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.distance(x[0],x[1]), axis=1)
fina_df = pd.concat([fina_df, new_df], axis=0).reset_index(drop=True)
print(final_df)
输出:
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
4 Name1a Name1b 1
5 Name1a Name2b 2
6 Name2a Name1b 2
7 Name2a Name2b 1
8 Name1a Name1b 1
9 Name1a Name2b 2
10 Name2a Name1b 2
11 Name2a Name2b 1
根据您的情况,将 2 更改为 300 或 500。这应该可以避免填满您的整个 RAM,如果可行,请告诉我!