pandas：分别计算每个类别的fuzzywuzzy

Question

我有如下数据集，只是行数更多：

import pandas as pd

data = {'First':  ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the  young boy is there','the young girl is here','the old girl is here']}

df = pd.DataFrame (data, columns = ['First','Second'])

我计算了整个数据集的 fuzzywuzzy 平均值，如下所示：

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def similarity_measure(doc1, doc2): 
    return fuzz.token_set_ratio(doc1, doc2)


d= df.groupby('First')['Second'].apply(lambda x: (', '.join(x)))
d= d.reset_index()
all=[]
for val in list(combinations(range(len(d)), 2)):
    all.append(similarity_measure(d.iloc[val[0],1],d.iloc[val[1],1]))


avg = sum(all)/len(all)
print('lexical overlap between all example pairs in the dataset is: ', avg)

但是，我还想分别获得第一列中每个类别的平均值。所以，我想要类似的东西（例如）：

similarity average for sentences in First value: 85.56
similarity average for sentences in Second value: 89.01
similarity average for sentences in Third value: 90.01

所以我想修改 for 循环以获得上述输出。

Answer 1

要计算每个组内的平均值，您需要两个步骤：

要按某些条件分组，在您的案例列中第一。看来你已经知道怎么做了。
创建一个函数来计算组的相似度 all_similarity_measure 函数在下面的代码中。

代码

import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations


def similarity_measure(doc1, doc2):
    return fuzz.token_set_ratio(doc1, doc2)


data = {'First': ['First value', 'Third value', 'Second value', 'First value', 'Third value', 'Second value'],
        'Second': ['the old man is here', 'the young girl is there', 'the old woman is here', 'the  young boy is there',
                   'the young girl is here', 'the old girl is here']}

df = pd.DataFrame(data, columns=['First', 'Second'])


def all_similarity_measure(gdf):
    """This function computes the similarity between all pairs of sentences in a Series"""
    return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()


res = df.groupby('First', as_index=False)['Second'].apply(all_similarity_measure)
print(res)

输出

          First  Second
0   First value    63.0
1  Second value    86.0
2   Third value    98.0

计算平均相似度的关键是这个表达式：

return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()

基本上，您使用 combinations (no need to access by index), construct a Series and compute mean 生成句子对。

可以使用任何计算平均值的函数来代替上面的函数，例如，您可以使用 statistics.mean，以避免构造序列。

from statistics import mean

def all_similarity_measure(gdf):
    """This function computes the similarity between all pairs of sentences in a Series"""
    return mean(similarity_measure(*docs) for docs in combinations(gdf, 2))

pandas：分别计算每个类别的fuzzywuzzy

pandas: calculate fuzzywuzzy for each category separately

average

categories

python-3.x

pandas

fuzzywuzzy