pandas:分别计算每个类别的fuzzywuzzy

pandas: calculate fuzzywuzzy for each category separately

我有如下数据集,只是行数更多:

import pandas as pd

data = {'First':  ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the  young boy is there','the young girl is here','the old girl is here']}

df = pd.DataFrame (data, columns = ['First','Second'])

我计算了整个数据集的 fuzzywuzzy 平均值,如下所示:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def similarity_measure(doc1, doc2): 
    return fuzz.token_set_ratio(doc1, doc2)


d= df.groupby('First')['Second'].apply(lambda x: (', '.join(x)))
d= d.reset_index()
all=[]
for val in list(combinations(range(len(d)), 2)):
    all.append(similarity_measure(d.iloc[val[0],1],d.iloc[val[1],1]))


avg = sum(all)/len(all)
print('lexical overlap between all example pairs in the dataset is: ', avg)

但是,我还想分别获得第一列中每个类别的平均值。 所以,我想要类似的东西(例如):

similarity average for sentences in First value: 85.56
similarity average for sentences in Second value: 89.01
similarity average for sentences in Third value: 90.01

所以我想修改 for 循环以获得上述输出。

要计算每个组内的平均值,您需要两个步骤:

  1. 要按某些条件分组,在您的案例列中 第一。看来你已经知道怎么做了。
  2. 创建一个函数来计算组的相似度 all_similarity_measure 函数在下面的代码中。

代码

import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations


def similarity_measure(doc1, doc2):
    return fuzz.token_set_ratio(doc1, doc2)


data = {'First': ['First value', 'Third value', 'Second value', 'First value', 'Third value', 'Second value'],
        'Second': ['the old man is here', 'the young girl is there', 'the old woman is here', 'the  young boy is there',
                   'the young girl is here', 'the old girl is here']}

df = pd.DataFrame(data, columns=['First', 'Second'])


def all_similarity_measure(gdf):
    """This function computes the similarity between all pairs of sentences in a Series"""
    return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()


res = df.groupby('First', as_index=False)['Second'].apply(all_similarity_measure)
print(res)

输出

          First  Second
0   First value    63.0
1  Second value    86.0
2   Third value    98.0

计算平均相似度的关键是这个表达式:

return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()

基本上,您使用 combinations (no need to access by index), construct a Series and compute mean 生成句子对。

可以使用任何计算平均值的函数来代替上面的函数,例如,您可以使用 statistics.mean,以避免构造序列。

from statistics import mean

def all_similarity_measure(gdf):
    """This function computes the similarity between all pairs of sentences in a Series"""
    return mean(similarity_measure(*docs) for docs in combinations(gdf, 2))