如何统计有多少个句子是相似的?
How to count how many sentences are similar?
我有一个由 2 列组成的数据集,一列用于用户,一列用于文本:
`User` `Text`
49 there is a cat under the table
21 the sun is hot
431 could you please close the window?
65 there is a cat under the table
21 the sun is hot
53 there is a cat under the table
我的预期输出是:
Text Freq
there is a cat under the table 3
the sun is hot 2
could you please close the window? 1
我的做法是用fuzz.partial_ratio
判断所有句子之间的匹配度(相似度),然后用groupby计算频数。
我正在使用 fuzz.partial_ratio 所以如果完全匹配,它将 return 1(100):
check_match =df.apply(lambda row: ((fuzz.partial_ratio(row['Text'], row['Text'])) >= value), axis=1)
其中值是阈值。这是确定matching/similarity
试试这个:
df = df.groupby('Text').count()
以下应该有效:
from collections import Counter
l=dict(Counter(df.Text))
new_df=pd.DataFrame({'Text':list(d.keys()),'Freq': list(d.values())})
你可以使用 value_counts()
df['Text'].value_counts()
我有一个由 2 列组成的数据集,一列用于用户,一列用于文本:
`User` `Text`
49 there is a cat under the table
21 the sun is hot
431 could you please close the window?
65 there is a cat under the table
21 the sun is hot
53 there is a cat under the table
我的预期输出是:
Text Freq
there is a cat under the table 3
the sun is hot 2
could you please close the window? 1
我的做法是用fuzz.partial_ratio
判断所有句子之间的匹配度(相似度),然后用groupby计算频数。
我正在使用 fuzz.partial_ratio 所以如果完全匹配,它将 return 1(100):
check_match =df.apply(lambda row: ((fuzz.partial_ratio(row['Text'], row['Text'])) >= value), axis=1)
其中值是阈值。这是确定matching/similarity
试试这个:
df = df.groupby('Text').count()
以下应该有效:
from collections import Counter
l=dict(Counter(df.Text))
new_df=pd.DataFrame({'Text':list(d.keys()),'Freq': list(d.values())})
你可以使用 value_counts()
df['Text'].value_counts()