使用 SequenceMatcher 比较 pandas 中两列中的字符串
Comparing strings within two columns in pandas with SequenceMatcher
我正在尝试确定 pandas 数据框中两列的相似性:
Text1 All
Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist. Where am I?
我想比较 'Performance results ... '
和 'The six...'
以及 'Accuracy is one...'
和 'Where am I?'
。
第一行应该有较高的两列之间的相似度,因为它包含一些词;第二个应该等于 0,因为两列之间没有共同的单词。
为了比较我使用 SequenceMatcher
的两列,如下所示:
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, df.Text1, df.All).ratio()
但是df.Text1, df.All
.
的用法好像不对
你能告诉我为什么吗?
SequenceMatcher
不是为 pandas 系列设计的。
- 你可以
.apply
这个函数。
SequenceMatcher
Examples
isjunk=None
连空格都不算垃圾。
isjunk=lambda y: y == " "
将空格视为垃圾。
from difflib import SequenceMatcher
import pandas as pd
data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.', 'Accuracy is one of the basic principles of perfectionist.'],
'All': ['The six top approaches and three others outperform the strong baseline.', 'Where am I?']}
df = pd.DataFrame(data)
# isjunk=lambda y: y == " "
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)
# display(df)
Text1 All ratio
0 Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline. 0.356164
1 Accuracy is one of the basic principles of perfectionist. Where am I? 0.088235
# isjunk=None
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(None, x[0], x[1]).ratio(), axis=1)
# display(df)
Text1 All ratio
0 Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline. 0.410959
1 Accuracy is one of the basic principles of perfectionist. Where am I? 0.117647
我正在尝试确定 pandas 数据框中两列的相似性:
Text1 All
Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist. Where am I?
我想比较 'Performance results ... '
和 'The six...'
以及 'Accuracy is one...'
和 'Where am I?'
。
第一行应该有较高的两列之间的相似度,因为它包含一些词;第二个应该等于 0,因为两列之间没有共同的单词。
为了比较我使用 SequenceMatcher
的两列,如下所示:
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, df.Text1, df.All).ratio()
但是df.Text1, df.All
.
你能告诉我为什么吗?
SequenceMatcher
不是为 pandas 系列设计的。- 你可以
.apply
这个函数。 SequenceMatcher
Examplesisjunk=None
连空格都不算垃圾。isjunk=lambda y: y == " "
将空格视为垃圾。
from difflib import SequenceMatcher
import pandas as pd
data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.', 'Accuracy is one of the basic principles of perfectionist.'],
'All': ['The six top approaches and three others outperform the strong baseline.', 'Where am I?']}
df = pd.DataFrame(data)
# isjunk=lambda y: y == " "
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)
# display(df)
Text1 All ratio
0 Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline. 0.356164
1 Accuracy is one of the basic principles of perfectionist. Where am I? 0.088235
# isjunk=None
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(None, x[0], x[1]).ratio(), axis=1)
# display(df)
Text1 All ratio
0 Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline. 0.410959
1 Accuracy is one of the basic principles of perfectionist. Where am I? 0.117647