仅获取模糊比率高于给定值的那些行的列平均值
Obtain a column average of only those rows with a fuzzy ratio above a given value
我正在尝试为分组的 pandas 列中的每个值获取具有 1 和 0 的另一列的平均值 只有 fuzz.partial_ratio()
列的数量超过给定限制(比如超过 80)。
我知道这可能不清楚,所以这是我的数据示例
col1 col2 col3
A Miami 1
A Miami 0
A Miami. 0
A Barcelona 0
A Barc elona 0
A Shanghai 1
A Shangai 0
B Miami 1
B Miami 1
B Miami. 1
B Barcelona 0
B Barc elona 0
B Shanghai 1
B Shangai 0
我正在尝试 groupby('col1')
并针对 col2
中的每个值在新列中估计 col3
的平均值 fuzzy_ratio
col2
高于 80.
例如,在行 0
、df['col2']='Miami'
中。然后,我想获得 'Miami' 的 fuzzy_ratio()
以及 col2
和 df['col1']='A'
中的所有值,并获得那些行的 col3
的平均值比率 >80 并将其存储在新列中。这将是行 1
和 2
,即 0。与行 2
相同,但在这种情况下,平均值将为 0.5。
我试图获得的输出如下所示:
col1 col2 col3 col4
A Miami 1 0.33
A Miami 0 0.33
A Miami. 0 0.33
A Barcelona 0 0
A Barc elona 0 0
A Shanghai 1 0.5
A Shangai 0 0.5
B Miami 1 1
B Miami 1 1
B Miami. 1 1
B Barcelona 0 0
B Barc elona 0 0
B Shanghai 1 0.5
B Shangai 0 0.5
我设法为 col2
中的每个值使用 for
循环来做到这一点,但我有一个相对较大的数据集(+1000 万行),这将花费很长时间。
有什么方法可以避免 for
循环来执行此任务?
非常感谢!!!!!
效率不高,但我认为可以满足您的需求
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
# helper function
def remove_element(lst, index):
"Removes an element from a list based on the index"
tmp = lst.copy()
del tmp[index]
return tmp
df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'col2':['Miami', 'Miami', 'Miami.', 'Barcelona', 'Barc elona',
'Shanghai', 'Shangai', 'Miami', 'Miami', 'Miami.',
'Barcelona', 'Barc elona', 'Shanghai', 'Shangai'],
'col3':[1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
# create a column that indicates the index of the element within the group
df['col2_index'] = 1
df['col2_index'] = df.groupby('col1')['col2_index'].cumsum() - 1
# create a list of the elements within the group
df['col2_list'] = df['col1'].map(df.groupby('col1')['col2'].apply(list))
df['col3_list'] = df['col1'].map(df.groupby('col1')['col3'].apply(list))
# remove the element associated with col2 and col3 respectively
df['col2_list'] = df.apply(lambda x: remove_element(x['col2_list'], x['col2_index']), axis=1)
df['col3_list'] = df.apply(lambda x: remove_element(x['col3_list'], x['col2_index']), axis=1)
# apply the threshold of 80 for the partial_ratio
df['key'] = df.apply(lambda x:
np.array([fuzz.partial_ratio(x['col2'], el) for el in x['col2_list']]) >= 80, axis=1)
# get the average of col3 for those that pass the threshold
df['result'] = df.apply(lambda x: np.mean(np.array(x['col3_list'])[x['key']]), axis=1)
df
col1 col2 col3 col2_index col2_list col3_list key result
0 A Miami 1 0 [Miami, Miami., Barcelona, Barc elona, Shangha... [0, 0, 0, 0, 1, 0] [True, True, False, False, False, False] 0.0
1 A Miami 0 1 [Miami, Miami., Barcelona, Barc elona, Shangha... [1, 0, 0, 0, 1, 0] [True, True, False, False, False, False] 0.5
2 A Miami. 0 2 [Miami, Miami, Barcelona, Barc elona, Shanghai... [1, 0, 0, 0, 1, 0] [True, True, False, False, False, False] 0.5
3 A Barcelona 0 3 [Miami, Miami, Miami., Barc elona, Shanghai, S... [1, 0, 0, 0, 1, 0] [False, False, False, True, False, False] 0.0
4 A Barc elona 0 4 [Miami, Miami, Miami., Barcelona, Shanghai, Sh... [1, 0, 0, 0, 1, 0] [False, False, False, True, False, False] 0.0
5 A Shanghai 1 5 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 0] [False, False, False, False, False, True] 0.0
6 A Shangai 0 6 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1] [False, False, False, False, False, True] 1.0
7 B Miami 1 0 [Miami, Miami., Barcelona, Barc elona, Shangha... [1, 1, 0, 0, 1, 0] [True, True, False, False, False, False] 1.0
8 B Miami 1 1 [Miami, Miami., Barcelona, Barc elona, Shangha... [1, 1, 0, 0, 1, 0] [True, True, False, False, False, False] 1.0
9 B Miami. 1 2 [Miami, Miami, Barcelona, Barc elona, Shanghai... [1, 1, 0, 0, 1, 0] [True, True, False, False, False, False] 1.0
10 B Barcelona 0 3 [Miami, Miami, Miami., Barc elona, Shanghai, S... [1, 1, 1, 0, 1, 0] [False, False, False, True, False, False] 0.0
11 B Barc elona 0 4 [Miami, Miami, Miami., Barcelona, Shanghai, Sh... [1, 1, 1, 0, 1, 0] [False, False, False, True, False, False] 0.0
12 B Shanghai 1 5 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 0] [False, False, False, False, False, True] 0.0
13 B Shangai 0 6 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1] [False, False, False, False, False, True] 1.0
对于更新后的问题,只需删除优化列表的代码部分
df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'col2':['Miami', 'Miami', 'Miami.', 'Barcelona', 'Barc elona',
'Shanghai', 'Shangai', 'Miami', 'Miami', 'Miami.',
'Barcelona', 'Barc elona', 'Shanghai', 'Shangai'],
'col3':[1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
df['col2_list'] = df['col1'].map(df.groupby('col1')['col2'].apply(list))
df['col3_list'] = df['col1'].map(df.groupby('col1')['col3'].apply(list))
df['key'] = df.apply(lambda x:
np.array([fuzz.partial_ratio(x['col2'], el) for el in x['col2_list']]) >= 80, axis=1)
df['result'] = df.apply(lambda x: np.mean(np.array(x['col3_list'])[x['key']]), axis=1)
df
col1 col2 col3 col2_list col3_list key result
0 A Miami 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [True, True, True, False, False, False, False] 0.333333
1 A Miami 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [True, True, True, False, False, False, False] 0.333333
2 A Miami. 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [True, True, True, False, False, False, False] 0.333333
3 A Barcelona 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [False, False, False, True, True, False, False] 0.000000
4 A Barc elona 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [False, False, False, True, True, False, False] 0.000000
5 A Shanghai 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [False, False, False, False, False, True, True] 0.500000
6 A Shangai 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [False, False, False, False, False, True, True] 0.500000
7 B Miami 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [True, True, True, False, False, False, False] 1.000000
8 B Miami 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [True, True, True, False, False, False, False] 1.000000
9 B Miami. 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [True, True, True, False, False, False, False] 1.000000
10 B Barcelona 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [False, False, False, True, True, False, False] 0.000000
11 B Barc elona 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [False, False, False, True, True, False, False] 0.000000
12 B Shanghai 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [False, False, False, False, False, True, True] 0.500000
13 B Shangai 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [False, False, False, False, False, True, True] 0.500000
我正在尝试为分组的 pandas 列中的每个值获取具有 1 和 0 的另一列的平均值 只有 fuzz.partial_ratio()
列的数量超过给定限制(比如超过 80)。
我知道这可能不清楚,所以这是我的数据示例
col1 col2 col3
A Miami 1
A Miami 0
A Miami. 0
A Barcelona 0
A Barc elona 0
A Shanghai 1
A Shangai 0
B Miami 1
B Miami 1
B Miami. 1
B Barcelona 0
B Barc elona 0
B Shanghai 1
B Shangai 0
我正在尝试 groupby('col1')
并针对 col2
中的每个值在新列中估计 col3
的平均值 fuzzy_ratio
col2
高于 80.
例如,在行 0
、df['col2']='Miami'
中。然后,我想获得 'Miami' 的 fuzzy_ratio()
以及 col2
和 df['col1']='A'
中的所有值,并获得那些行的 col3
的平均值比率 >80 并将其存储在新列中。这将是行 1
和 2
,即 0。与行 2
相同,但在这种情况下,平均值将为 0.5。
我试图获得的输出如下所示:
col1 col2 col3 col4
A Miami 1 0.33
A Miami 0 0.33
A Miami. 0 0.33
A Barcelona 0 0
A Barc elona 0 0
A Shanghai 1 0.5
A Shangai 0 0.5
B Miami 1 1
B Miami 1 1
B Miami. 1 1
B Barcelona 0 0
B Barc elona 0 0
B Shanghai 1 0.5
B Shangai 0 0.5
我设法为 col2
中的每个值使用 for
循环来做到这一点,但我有一个相对较大的数据集(+1000 万行),这将花费很长时间。
有什么方法可以避免 for
循环来执行此任务?
非常感谢!!!!!
效率不高,但我认为可以满足您的需求
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
# helper function
def remove_element(lst, index):
"Removes an element from a list based on the index"
tmp = lst.copy()
del tmp[index]
return tmp
df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'col2':['Miami', 'Miami', 'Miami.', 'Barcelona', 'Barc elona',
'Shanghai', 'Shangai', 'Miami', 'Miami', 'Miami.',
'Barcelona', 'Barc elona', 'Shanghai', 'Shangai'],
'col3':[1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
# create a column that indicates the index of the element within the group
df['col2_index'] = 1
df['col2_index'] = df.groupby('col1')['col2_index'].cumsum() - 1
# create a list of the elements within the group
df['col2_list'] = df['col1'].map(df.groupby('col1')['col2'].apply(list))
df['col3_list'] = df['col1'].map(df.groupby('col1')['col3'].apply(list))
# remove the element associated with col2 and col3 respectively
df['col2_list'] = df.apply(lambda x: remove_element(x['col2_list'], x['col2_index']), axis=1)
df['col3_list'] = df.apply(lambda x: remove_element(x['col3_list'], x['col2_index']), axis=1)
# apply the threshold of 80 for the partial_ratio
df['key'] = df.apply(lambda x:
np.array([fuzz.partial_ratio(x['col2'], el) for el in x['col2_list']]) >= 80, axis=1)
# get the average of col3 for those that pass the threshold
df['result'] = df.apply(lambda x: np.mean(np.array(x['col3_list'])[x['key']]), axis=1)
df
col1 col2 col3 col2_index col2_list col3_list key result
0 A Miami 1 0 [Miami, Miami., Barcelona, Barc elona, Shangha... [0, 0, 0, 0, 1, 0] [True, True, False, False, False, False] 0.0
1 A Miami 0 1 [Miami, Miami., Barcelona, Barc elona, Shangha... [1, 0, 0, 0, 1, 0] [True, True, False, False, False, False] 0.5
2 A Miami. 0 2 [Miami, Miami, Barcelona, Barc elona, Shanghai... [1, 0, 0, 0, 1, 0] [True, True, False, False, False, False] 0.5
3 A Barcelona 0 3 [Miami, Miami, Miami., Barc elona, Shanghai, S... [1, 0, 0, 0, 1, 0] [False, False, False, True, False, False] 0.0
4 A Barc elona 0 4 [Miami, Miami, Miami., Barcelona, Shanghai, Sh... [1, 0, 0, 0, 1, 0] [False, False, False, True, False, False] 0.0
5 A Shanghai 1 5 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 0] [False, False, False, False, False, True] 0.0
6 A Shangai 0 6 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1] [False, False, False, False, False, True] 1.0
7 B Miami 1 0 [Miami, Miami., Barcelona, Barc elona, Shangha... [1, 1, 0, 0, 1, 0] [True, True, False, False, False, False] 1.0
8 B Miami 1 1 [Miami, Miami., Barcelona, Barc elona, Shangha... [1, 1, 0, 0, 1, 0] [True, True, False, False, False, False] 1.0
9 B Miami. 1 2 [Miami, Miami, Barcelona, Barc elona, Shanghai... [1, 1, 0, 0, 1, 0] [True, True, False, False, False, False] 1.0
10 B Barcelona 0 3 [Miami, Miami, Miami., Barc elona, Shanghai, S... [1, 1, 1, 0, 1, 0] [False, False, False, True, False, False] 0.0
11 B Barc elona 0 4 [Miami, Miami, Miami., Barcelona, Shanghai, Sh... [1, 1, 1, 0, 1, 0] [False, False, False, True, False, False] 0.0
12 B Shanghai 1 5 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 0] [False, False, False, False, False, True] 0.0
13 B Shangai 0 6 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1] [False, False, False, False, False, True] 1.0
对于更新后的问题,只需删除优化列表的代码部分
df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'col2':['Miami', 'Miami', 'Miami.', 'Barcelona', 'Barc elona',
'Shanghai', 'Shangai', 'Miami', 'Miami', 'Miami.',
'Barcelona', 'Barc elona', 'Shanghai', 'Shangai'],
'col3':[1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
df['col2_list'] = df['col1'].map(df.groupby('col1')['col2'].apply(list))
df['col3_list'] = df['col1'].map(df.groupby('col1')['col3'].apply(list))
df['key'] = df.apply(lambda x:
np.array([fuzz.partial_ratio(x['col2'], el) for el in x['col2_list']]) >= 80, axis=1)
df['result'] = df.apply(lambda x: np.mean(np.array(x['col3_list'])[x['key']]), axis=1)
df
col1 col2 col3 col2_list col3_list key result
0 A Miami 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [True, True, True, False, False, False, False] 0.333333
1 A Miami 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [True, True, True, False, False, False, False] 0.333333
2 A Miami. 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [True, True, True, False, False, False, False] 0.333333
3 A Barcelona 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [False, False, False, True, True, False, False] 0.000000
4 A Barc elona 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [False, False, False, True, True, False, False] 0.000000
5 A Shanghai 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [False, False, False, False, False, True, True] 0.500000
6 A Shangai 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 0, 0, 0, 0, 1, 0] [False, False, False, False, False, True, True] 0.500000
7 B Miami 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [True, True, True, False, False, False, False] 1.000000
8 B Miami 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [True, True, True, False, False, False, False] 1.000000
9 B Miami. 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [True, True, True, False, False, False, False] 1.000000
10 B Barcelona 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [False, False, False, True, True, False, False] 0.000000
11 B Barc elona 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [False, False, False, True, True, False, False] 0.000000
12 B Shanghai 1 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [False, False, False, False, False, True, True] 0.500000
13 B Shangai 0 [Miami, Miami, Miami., Barcelona, Barc elona, ... [1, 1, 1, 0, 0, 1, 0] [False, False, False, False, False, True, True] 0.500000