pandas: 计算所有类别的平均相似度
pandas: calculatig average similarity across all categories
我有一个类似下面但更大的数据框:
import pandas as pd
data = {'First': ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the young boy is there','the young girl is here','the old girl is here']}
df = pd.DataFrame (data, columns = ['First','Second'])
并且我计算了基于第一列的每个可能对之间的平均相似度(从 Whosebug 中的其他答案获得了这部分的帮助):
from itertools import combinations
#function to calculate similarity between each pairs of documents
def similarity_measure(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float (len(intersection)) / len(union) * 100
#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x: nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()
#returning the similarity measures for each pair in the dataset
for val in list(combinations(range(len(data_similarity)), 2)):
print(f"similarity between {data_similarity.iloc[val[0],0]} and {data_similarity.iloc[val[1],0]} intents is: {similarity_measure(data_similarity.iloc[val[0],1],data_similarity.iloc[val[1],1])}")
我想要的输出是所有对的平均值,例如,如果上面的代码具有以下输出:
similarity between first value and second value is 60
similarity between first value and third value is 50
similarity between second value and third value is 55
similarity between second value and first value is 60
similarity between third value and first value is 50
similarity between third value and second value is 55
我想得到第一个值与任何组合的平均分,第二个值与任何组合的平均分,第三个值与任何组合的平均分,如下所示:
first value average across all possible values is 55
second value average across all possible values is 57.5
third value average across all possible values is 52.5
编辑:根据您的评论,您可以执行以下操作。
- 首先计算
data_similarity
table 将来自不同句子的标记组合为组。
- 计算句子之间的成对相似度元组
- 将它们放入一个数据框中,然后按整个组进行分组并取平均值。
import nltk
from itertools import combinations, product
#function to calculate similarity between each pairs of documents
def similarity_measure(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float (len(intersection)) / len(union) * 100
#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x: nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()
all_pairs = [(i,l,similarity_measure(j,m)) for (i,j),(l,m) in
product(zip(data_similarity['First'], data_similarity['Second']), repeat=2) if i!=l]
pair_similarity = pd.DataFrame(all_pairs, columns=['A','B','Similarity'])
group_similarity = pair_similarity.groupby(['A'])['Similarity'].mean().reset_index()
print(group_similarity)
A Similarity
0 First value 47.777778
1 Second value 45.000000
2 Third value 52.777778
我有一个类似下面但更大的数据框:
import pandas as pd
data = {'First': ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the young boy is there','the young girl is here','the old girl is here']}
df = pd.DataFrame (data, columns = ['First','Second'])
并且我计算了基于第一列的每个可能对之间的平均相似度(从 Whosebug 中的其他答案获得了这部分的帮助):
from itertools import combinations
#function to calculate similarity between each pairs of documents
def similarity_measure(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float (len(intersection)) / len(union) * 100
#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x: nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()
#returning the similarity measures for each pair in the dataset
for val in list(combinations(range(len(data_similarity)), 2)):
print(f"similarity between {data_similarity.iloc[val[0],0]} and {data_similarity.iloc[val[1],0]} intents is: {similarity_measure(data_similarity.iloc[val[0],1],data_similarity.iloc[val[1],1])}")
我想要的输出是所有对的平均值,例如,如果上面的代码具有以下输出:
similarity between first value and second value is 60
similarity between first value and third value is 50
similarity between second value and third value is 55
similarity between second value and first value is 60
similarity between third value and first value is 50
similarity between third value and second value is 55
我想得到第一个值与任何组合的平均分,第二个值与任何组合的平均分,第三个值与任何组合的平均分,如下所示:
first value average across all possible values is 55
second value average across all possible values is 57.5
third value average across all possible values is 52.5
编辑:根据您的评论,您可以执行以下操作。
- 首先计算
data_similarity
table 将来自不同句子的标记组合为组。 - 计算句子之间的成对相似度元组
- 将它们放入一个数据框中,然后按整个组进行分组并取平均值。
import nltk
from itertools import combinations, product
#function to calculate similarity between each pairs of documents
def similarity_measure(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float (len(intersection)) / len(union) * 100
#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x: nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()
all_pairs = [(i,l,similarity_measure(j,m)) for (i,j),(l,m) in
product(zip(data_similarity['First'], data_similarity['Second']), repeat=2) if i!=l]
pair_similarity = pd.DataFrame(all_pairs, columns=['A','B','Similarity'])
group_similarity = pair_similarity.groupby(['A'])['Similarity'].mean().reset_index()
print(group_similarity)
A Similarity
0 First value 47.777778
1 Second value 45.000000
2 Third value 52.777778