数据框中每行两列的字符串匹配
String matching per row of two columns in a dataframe
假设我有一个 pandas 数据框,如下所示:
ID String1 String2
1 The big black wolf The small wolf
2 Close the door on way out door the Close
3 where's the money where is the money
4 123 further out out further
我想在进行模糊字符串匹配之前交叉标记 String1 和 String2 列中的每一行,类似于 。
我的挑战是,我发布的 link 中的解决方案仅在 String1 和 String2 中的字数相同时才有效。其次,该解决方案会查看列中的所有行,而我希望我的解决方案只进行逐行比较。
建议的解决方案应该对第 1 行进行矩阵比较,例如:
string1 The big black wolf Maximum
string2
The 100 0 0 0 100
small 0 0 0 0 0
wolf 0 0 0 100 100
ID String1 String2 Matching_Average
1 The big black wolf The small wolf 66.67
2 Close the door on way out door the Close
3 where's the money where is the money
4 123 further out out further
其中匹配平均值是 'maximum' 列的总和除以 String2
中的单词数
您可以先从 2 系列中获取虚拟值,然后获取列的交集,将它们相加并除以第二列的虚拟值:
a = df['String1'].str.get_dummies(' ')
b = df['String2'].str.get_dummies(' ')
u = b[b.columns.intersection(a.columns)]
df['Matching_Average'] = u.sum(1).div(b.sum(1)).mul(100).round(2)
print(df)
ID String1 String2 Matching_Average
0 1 The big black wolf The small wolf 66.67
1 2 Close the door on way out door the Close 100.00
2 3 where's the money where is the money 50.00
3 4 123 further out out further 100.00
否则如果你对字符串匹配算法没问题,你可以使用difflib
:
from difflib import SequenceMatcher
[SequenceMatcher(None,x,y).ratio() for x,y in zip(df['String1'],df['String2'])]
#[0.625, 0.2564102564102564, 0.9142857142857143, 0.6153846153846154]
假设我有一个 pandas 数据框,如下所示:
ID String1 String2
1 The big black wolf The small wolf
2 Close the door on way out door the Close
3 where's the money where is the money
4 123 further out out further
我想在进行模糊字符串匹配之前交叉标记 String1 和 String2 列中的每一行,类似于
我的挑战是,我发布的 link 中的解决方案仅在 String1 和 String2 中的字数相同时才有效。其次,该解决方案会查看列中的所有行,而我希望我的解决方案只进行逐行比较。
建议的解决方案应该对第 1 行进行矩阵比较,例如:
string1 The big black wolf Maximum
string2
The 100 0 0 0 100
small 0 0 0 0 0
wolf 0 0 0 100 100
ID String1 String2 Matching_Average
1 The big black wolf The small wolf 66.67
2 Close the door on way out door the Close
3 where's the money where is the money
4 123 further out out further
其中匹配平均值是 'maximum' 列的总和除以 String2
中的单词数您可以先从 2 系列中获取虚拟值,然后获取列的交集,将它们相加并除以第二列的虚拟值:
a = df['String1'].str.get_dummies(' ')
b = df['String2'].str.get_dummies(' ')
u = b[b.columns.intersection(a.columns)]
df['Matching_Average'] = u.sum(1).div(b.sum(1)).mul(100).round(2)
print(df)
ID String1 String2 Matching_Average
0 1 The big black wolf The small wolf 66.67
1 2 Close the door on way out door the Close 100.00
2 3 where's the money where is the money 50.00
3 4 123 further out out further 100.00
否则如果你对字符串匹配算法没问题,你可以使用difflib
:
from difflib import SequenceMatcher
[SequenceMatcher(None,x,y).ratio() for x,y in zip(df['String1'],df['String2'])]
#[0.625, 0.2564102564102564, 0.9142857142857143, 0.6153846153846154]