仅获取模糊比率高于给定值的那些行的列平均值

Obtain a column average of only those rows with a fuzzy ratio above a given value

我正在尝试为分组的 pandas 列中的每个值获取具有 1 和 0 的另一列的平均值 只有 fuzz.partial_ratio()列的数量超过给定限制(比如超过 80)。

我知道这可能不清楚,所以这是我的数据示例

col1 col2      col3
A    Miami       1
A    Miami       0
A    Miami.      0
A    Barcelona   0
A    Barc elona  0
A    Shanghai    1
A    Shangai     0
B    Miami       1
B    Miami       1
B    Miami.      1
B    Barcelona   0
B    Barc elona  0
B    Shanghai    1
B    Shangai    0

我正在尝试 groupby('col1') 并针对 col2 中的每个值在新列中估计 col3 的平均值 fuzzy_ratio col2 高于 80.

例如,在行 0df['col2']='Miami' 中。然后,我想获得 'Miami' 的 fuzzy_ratio() 以及 col2df['col1']='A' 中的所有值,并获得那些行的 col3 的平均值比率 >80 并将其存储在新列中。这将是行 12,即 0。与行 2 相同,但在这种情况下,平均值将为 0.5。

我试图获得的输出如下所示:

col1 col2      col3 col4
A    Miami       1   0.33
A    Miami       0   0.33
A    Miami.      0   0.33
A    Barcelona   0   0
A    Barc elona  0   0
A    Shanghai    1   0.5
A    Shangai     0   0.5
B    Miami       1   1
B    Miami       1   1
B    Miami.      1   1
B    Barcelona   0   0
B    Barc elona  0   0
B    Shanghai    1   0.5
B    Shangai     0   0.5

我设法为 col2 中的每个值使用 for 循环来做到这一点,但我有一个相对较大的数据集(+1000 万行),这将花费很长时间。

有什么方法可以避免 for 循环来执行此任务?

非常感谢!!!!!

效率不高,但我认为可以满足您的需求

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np

# helper function
def remove_element(lst, index):
    "Removes an element from a list based on the index"
    tmp = lst.copy()
    del tmp[index]
    return tmp


df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'A', 'A', 'A', 
                           'B', 'B', 'B', 'B', 'B', 'B', 'B'], 
                  'col2':['Miami', 'Miami', 'Miami.', 'Barcelona', 'Barc elona', 
                         'Shanghai', 'Shangai', 'Miami', 'Miami', 'Miami.', 
                         'Barcelona', 'Barc elona', 'Shanghai', 'Shangai'], 
                  'col3':[1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0]})

# create a column that indicates the index of the element within the group
df['col2_index'] = 1
df['col2_index'] = df.groupby('col1')['col2_index'].cumsum() - 1

# create a list of the elements within the group
df['col2_list'] = df['col1'].map(df.groupby('col1')['col2'].apply(list))
df['col3_list'] = df['col1'].map(df.groupby('col1')['col3'].apply(list))

# remove the element associated with col2 and col3 respectively
df['col2_list'] = df.apply(lambda x: remove_element(x['col2_list'], x['col2_index']), axis=1)
df['col3_list'] = df.apply(lambda x: remove_element(x['col3_list'], x['col2_index']), axis=1)

# apply the threshold of 80 for the partial_ratio
df['key'] = df.apply(lambda x: 
         np.array([fuzz.partial_ratio(x['col2'], el) for el in x['col2_list']]) >= 80, axis=1)

# get the average of col3 for those that pass the threshold
df['result'] = df.apply(lambda x: np.mean(np.array(x['col3_list'])[x['key']]), axis=1)

df

 col1   col2       col3 col2_index  col2_list                                           col3_list           key                                      result
0   A   Miami       1   0           [Miami, Miami., Barcelona, Barc elona, Shangha...   [0, 0, 0, 0, 1, 0]  [True, True, False, False, False, False]    0.0
1   A   Miami       0   1           [Miami, Miami., Barcelona, Barc elona, Shangha...   [1, 0, 0, 0, 1, 0]  [True, True, False, False, False, False]    0.5
2   A   Miami.      0   2           [Miami, Miami, Barcelona, Barc elona, Shanghai...   [1, 0, 0, 0, 1, 0]  [True, True, False, False, False, False]    0.5
3   A   Barcelona   0   3           [Miami, Miami, Miami., Barc elona, Shanghai, S...   [1, 0, 0, 0, 1, 0]  [False, False, False, True, False, False]   0.0
4   A   Barc elona  0   4           [Miami, Miami, Miami., Barcelona, Shanghai, Sh...   [1, 0, 0, 0, 1, 0]  [False, False, False, True, False, False]   0.0
5   A   Shanghai    1   5           [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 0, 0, 0, 0, 0]  [False, False, False, False, False, True]   0.0
6   A   Shangai     0   6           [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 0, 0, 0, 0, 1]  [False, False, False, False, False, True]   1.0
7   B   Miami       1   0           [Miami, Miami., Barcelona, Barc elona, Shangha...   [1, 1, 0, 0, 1, 0]  [True, True, False, False, False, False]    1.0
8   B   Miami       1   1           [Miami, Miami., Barcelona, Barc elona, Shangha...   [1, 1, 0, 0, 1, 0]  [True, True, False, False, False, False]    1.0
9   B   Miami.      1   2           [Miami, Miami, Barcelona, Barc elona, Shanghai...   [1, 1, 0, 0, 1, 0]  [True, True, False, False, False, False]    1.0
10  B   Barcelona   0   3           [Miami, Miami, Miami., Barc elona, Shanghai, S...   [1, 1, 1, 0, 1, 0]  [False, False, False, True, False, False]   0.0
11  B   Barc elona  0   4           [Miami, Miami, Miami., Barcelona, Shanghai, Sh...   [1, 1, 1, 0, 1, 0]  [False, False, False, True, False, False]   0.0
12  B   Shanghai    1   5           [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 1, 1, 0, 0, 0]  [False, False, False, False, False, True]   0.0
13  B   Shangai     0   6           [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 1, 1, 0, 0, 1]  [False, False, False, False, False, True]   1.0

对于更新后的问题,只需删除优化列表的代码部分

df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'A', 'A', 'A', 
                           'B', 'B', 'B', 'B', 'B', 'B', 'B'], 
                  'col2':['Miami', 'Miami', 'Miami.', 'Barcelona', 'Barc elona', 
                         'Shanghai', 'Shangai', 'Miami', 'Miami', 'Miami.', 
                         'Barcelona', 'Barc elona', 'Shanghai', 'Shangai'], 
                  'col3':[1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0]})


df['col2_list'] = df['col1'].map(df.groupby('col1')['col2'].apply(list))
df['col3_list'] = df['col1'].map(df.groupby('col1')['col3'].apply(list))

df['key'] = df.apply(lambda x: 
         np.array([fuzz.partial_ratio(x['col2'], el) for el in x['col2_list']]) >= 80, axis=1)

df['result'] = df.apply(lambda x: np.mean(np.array(x['col3_list'])[x['key']]), axis=1)

df

  col1  col2     col3   col2_list                                           col3_list               key                                             result
0   A   Miami       1   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 0, 0, 0, 0, 1, 0]   [True, True, True, False, False, False, False]  0.333333
1   A   Miami       0   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 0, 0, 0, 0, 1, 0]   [True, True, True, False, False, False, False]  0.333333
2   A   Miami.      0   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 0, 0, 0, 0, 1, 0]   [True, True, True, False, False, False, False]  0.333333
3   A   Barcelona   0   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 0, 0, 0, 0, 1, 0]   [False, False, False, True, True, False, False] 0.000000
4   A   Barc elona  0   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 0, 0, 0, 0, 1, 0]   [False, False, False, True, True, False, False] 0.000000
5   A   Shanghai    1   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 0, 0, 0, 0, 1, 0]   [False, False, False, False, False, True, True] 0.500000
6   A   Shangai     0   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 0, 0, 0, 0, 1, 0]   [False, False, False, False, False, True, True] 0.500000
7   B   Miami       1   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 1, 1, 0, 0, 1, 0]   [True, True, True, False, False, False, False]  1.000000
8   B   Miami       1   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 1, 1, 0, 0, 1, 0]   [True, True, True, False, False, False, False]  1.000000
9   B   Miami.      1   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 1, 1, 0, 0, 1, 0]   [True, True, True, False, False, False, False]  1.000000
10  B   Barcelona   0   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 1, 1, 0, 0, 1, 0]   [False, False, False, True, True, False, False] 0.000000
11  B   Barc elona  0   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 1, 1, 0, 0, 1, 0]   [False, False, False, True, True, False, False] 0.000000
12  B   Shanghai    1   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 1, 1, 0, 0, 1, 0]   [False, False, False, False, False, True, True] 0.500000
13  B   Shangai     0   [Miami, Miami, Miami., Barcelona, Barc elona, ...   [1, 1, 1, 0, 0, 1, 0]   [False, False, False, False, False, True, True] 0.500000