如何统计有多少个句子是相似的？

Question

我有一个由 2 列组成的数据集，一列用于用户，一列用于文本：

`User`        `Text`
49        there is a cat under the table
21        the sun is hot
431       could you please close the window?
65        there is a cat under the table
21        the sun is hot
53        there is a cat under the table

我的预期输出是：

Text                                   Freq         
there is a cat under the table          3
the sun is hot                          2
could you please close the window?      1

我的做法是用fuzz.partial_ratio判断所有句子之间的匹配度（相似度），然后用groupby计算频数。

我正在使用 fuzz.partial_ratio 所以如果完全匹配，它将 return 1(100):

check_match =df.apply(lambda row: ((fuzz.partial_ratio(row['Text'], row['Text'])) >= value), axis=1)

其中值是阈值。这是确定matching/similarity

Answer 1

试试这个：

df = df.groupby('Text').count()

Answer 2

以下应该有效：

from collections import Counter

l=dict(Counter(df.Text))
new_df=pd.DataFrame({'Text':list(d.keys()),'Freq': list(d.values())})

Answer 3

你可以使用 value_counts()

df['Text'].value_counts()

如何统计有多少个句子是相似的？

How to count how many sentences are similar?

python

pandas

fuzzywuzzy