根据组中其他值的平均值创建新列
Creating a new column based on the mean of other values in group
我试图通过排除焦点公司来计算其他值的平均值。我知道这有点复杂,但让我解释一下:
例如,假设下面的代码是我的数据:
d = {'col1': ["A", "A", "A", "B", "B", "B", "c", "c","c", "d", "d", "d", "e", "e", "e"],
'col2': [2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017],
'col3': [10, 20, 25, 10, 12, 14, 8, 9, 10, 50, 60, 70, 40, 50, 60],
'group':[10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20,20]}
df = pd.DataFrame(d)
我想通过考虑 df.group 获得 (B+C) 2015 年的平均值并将其添加到 A.2016 中的新列中。因此,我们需要通过排除焦点项目来对前一年取 df.group 的平均值。
结果应该是这样的:
d = {'col1': ["A", "A", "A", "B", "B", "B", "c", "c", "c", "d", "d", "d", "e", "e", "e"],
'col2': [2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017],
'col3': [10, 20, 25, 10, 12, 14, 8, 9, 10, 50, 60, 70, 40, 50, 60],
'group':[10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20,20],
'operation':['0', '(B2015+C2015)/2', '(B2016+C2016)/2', '0', '(A2015+C2015)/2', '(A2016+C2016)/2', '0', '(A2015+B2015)/2', '(A2016+B2016)/2',"0", "E2015", "E2016", "0","D2015", "D2016" ],
'mean': [nan, 9, 10.5, nan, 9, 14.5, nan, 10, 16, nan, 40, 50, nan, 50, 60]}
output = pd.DataFrame(d)
>>> output
col1 col2 col3 group operation mean
0 A 2015 10 10 nan 0.0
1 A 2016 20 10 (B2015+C2015)/2 9.0
2 A 2017 25 10 (B2016+C2016)/2 10.5
3 B 2015 10 10 0 0.0
4 B 2016 12 10 (A2015+C2015)/2 9.0
5 B 2017 14 10 (A2016+C2016)/2 14.5
6 c 2015 8 10 0 0.0
7 c 2016 9 10 (A2015+B2015)/2 10.0
8 c 2017 10 10 (A2016+B2016)/2 16.0
9 d 2015 50 20 0 0.0
10 d 2016 60 20 E2015 40.0
11 d 2017 70 20 E2016 50.0
12 e 2015 40 20 0 0.0
13 e 2016 50 20 D2015 50.0
14 e 2017 60 20 D2016 60.0
- 使用双
groupby
: 计算每组内所有其他值的均值
sum
组内的所有值
- 减去当前(焦点)值
- 除以组中项目数减一
- 将
shift
-ed 方法分配给新列:
means = df.groupby("group").apply(lambda x: x.groupby("col2")["col3"].transform("sum").sub(x["col3"]).div(len(x["col1"].unique())-1)).droplevel(0)
df["mean"] = means.shift().where(df["col1"].eq(df["col1"].shift()),0)
>>> df
col1 col2 col3 group mean
0 A 2015 10 10 0.0
1 A 2016 20 10 9.0
2 A 2017 25 10 10.5
3 B 2015 10 10 0.0
4 B 2016 12 10 9.0
5 B 2017 14 10 14.5
6 c 2015 8 10 0.0
7 c 2016 9 10 10.0
8 c 2017 10 10 16.0
9 d 2015 50 20 0.0
10 d 2016 60 20 40.0
11 d 2017 70 20 50.0
12 e 2015 40 20 0.0
13 e 2016 50 20 50.0
14 e 2017 60 20 60.0
我试图通过排除焦点公司来计算其他值的平均值。我知道这有点复杂,但让我解释一下:
例如,假设下面的代码是我的数据:
d = {'col1': ["A", "A", "A", "B", "B", "B", "c", "c","c", "d", "d", "d", "e", "e", "e"],
'col2': [2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017],
'col3': [10, 20, 25, 10, 12, 14, 8, 9, 10, 50, 60, 70, 40, 50, 60],
'group':[10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20,20]}
df = pd.DataFrame(d)
我想通过考虑 df.group 获得 (B+C) 2015 年的平均值并将其添加到 A.2016 中的新列中。因此,我们需要通过排除焦点项目来对前一年取 df.group 的平均值。
结果应该是这样的:
d = {'col1': ["A", "A", "A", "B", "B", "B", "c", "c", "c", "d", "d", "d", "e", "e", "e"],
'col2': [2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017],
'col3': [10, 20, 25, 10, 12, 14, 8, 9, 10, 50, 60, 70, 40, 50, 60],
'group':[10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20,20],
'operation':['0', '(B2015+C2015)/2', '(B2016+C2016)/2', '0', '(A2015+C2015)/2', '(A2016+C2016)/2', '0', '(A2015+B2015)/2', '(A2016+B2016)/2',"0", "E2015", "E2016", "0","D2015", "D2016" ],
'mean': [nan, 9, 10.5, nan, 9, 14.5, nan, 10, 16, nan, 40, 50, nan, 50, 60]}
output = pd.DataFrame(d)
>>> output
col1 col2 col3 group operation mean
0 A 2015 10 10 nan 0.0
1 A 2016 20 10 (B2015+C2015)/2 9.0
2 A 2017 25 10 (B2016+C2016)/2 10.5
3 B 2015 10 10 0 0.0
4 B 2016 12 10 (A2015+C2015)/2 9.0
5 B 2017 14 10 (A2016+C2016)/2 14.5
6 c 2015 8 10 0 0.0
7 c 2016 9 10 (A2015+B2015)/2 10.0
8 c 2017 10 10 (A2016+B2016)/2 16.0
9 d 2015 50 20 0 0.0
10 d 2016 60 20 E2015 40.0
11 d 2017 70 20 E2016 50.0
12 e 2015 40 20 0 0.0
13 e 2016 50 20 D2015 50.0
14 e 2017 60 20 D2016 60.0
- 使用双
groupby
: 计算每组内所有其他值的均值
sum
组内的所有值- 减去当前(焦点)值
- 除以组中项目数减一
- 将
shift
-ed 方法分配给新列:
means = df.groupby("group").apply(lambda x: x.groupby("col2")["col3"].transform("sum").sub(x["col3"]).div(len(x["col1"].unique())-1)).droplevel(0)
df["mean"] = means.shift().where(df["col1"].eq(df["col1"].shift()),0)
>>> df
col1 col2 col3 group mean
0 A 2015 10 10 0.0
1 A 2016 20 10 9.0
2 A 2017 25 10 10.5
3 B 2015 10 10 0.0
4 B 2016 12 10 9.0
5 B 2017 14 10 14.5
6 c 2015 8 10 0.0
7 c 2016 9 10 10.0
8 c 2017 10 10 16.0
9 d 2015 50 20 0.0
10 d 2016 60 20 40.0
11 d 2017 70 20 50.0
12 e 2015 40 20 0.0
13 e 2016 50 20 50.0
14 e 2017 60 20 60.0