如何统计有多少个句子是相似的?

How to count how many sentences are similar?

我有一个由 2 列组成的数据集,一列用于用户,一列用于文本:

`User`        `Text`
49        there is a cat under the table
21        the sun is hot
431       could you please close the window?
65        there is a cat under the table
21        the sun is hot
53        there is a cat under the table

我的预期输出是:

Text                                   Freq         
there is a cat under the table          3
the sun is hot                          2
could you please close the window?      1

我的做法是用fuzz.partial_ratio判断所有句子之间的匹配度(相似度),然后用groupby计算频数。

我正在使用 fuzz.partial_ratio 所以如果完全匹配,它将 return 1(100):

check_match =df.apply(lambda row: ((fuzz.partial_ratio(row['Text'], row['Text'])) >= value), axis=1)

其中值是阈值。这是确定matching/similarity

试试这个:

df = df.groupby('Text').count()

以下应该有效:

from collections import Counter

l=dict(Counter(df.Text))
new_df=pd.DataFrame({'Text':list(d.keys()),'Freq': list(d.values())})

你可以使用 value_counts()

df['Text'].value_counts()