Groupby 和 Sum - 创建添加了 If 条件的新列
Groupby & Sum - Create new column with added If Condition
我有以下数据框:
ID Start End Variance
1 100000 120000 20000
1 1 0 -1
1 7815.58 7815.58 0
1 5261 5261 0
1 138783.2 89969.37 -48813.83
1 2459.92 2459.92 0
2 101421.99 93387.45 -8034.54
2 940.04 940.04 0
2 63.06 63.06 0
2 2454.86 2454.86 0
2 830 830 0
2 299 299 0
2 14000 12000 2000
2 1500 500 1000
我想创建一个新专栏,Overspend Total
。但我只想对大于 0 的值求和。生成的 DataFrame 将如下所示:
ID Start End Variance Overspend Total
1 100000 120000 20000 20000
1 1 0 -1 20000
1 7815.58 7815.58 0 20000
1 5261 5261 0 20000
1 138783.2 89969.37 -48813.83 20000
1 2459.92 2459.92 0 20000
2 101421.99 93387.45 -8034.54 3000
2 940.04 940.04 0 3000
2 63.06 63.06 0 3000
2 2454.86 2454.86 0 3000
2 830 830 0 3000
2 299 299 0 3000
2 14000 12000 2000 3000
2 1500 500 1000 3000
我尝试了以下方法
df['Overspend Variance'] = df[df['Variance'] > 0].groupby(df['ID']).transform('sum')
但我收到以下错误:
ValueError: Wrong number of items passed 8, placement implies 1
我知道 df['Overspend Variance'] = df['Variance'].groupby(df['ID']).transform('sum')
可以在没有条件的情况下工作,但我不知道如何将它与额外条件结合起来。
可以通过筛选小于 0 的值而不是 group by 并重新分配来完成
df = df.join(df[df.Variance>=0].groupby("ID")["Variance"].agg(sum), on="ID", rsuffix="total")
df.columns = ["ID", "Start", "End", "Variance", "Overspend Total"]
ID Start End Variance Overspend Total
0 1 100000.00 120000.00 20000.00 20000.0
1 1 1.00 0.00 -1.00 20000.0
2 1 7815.58 7815.58 0.00 20000.0
3 1 5261.00 5261.00 0.00 20000.0
4 1 138783.20 89969.37 -48813.83 20000.0
5 1 2459.92 2459.92 0.00 20000.0
6 2 101421.99 93387.45 -8034.54 3000.0
7 2 940.04 940.04 0.00 3000.0
8 2 63.06 63.06 0.00 3000.0
9 2 2454.86 2454.86 0.00 3000.0
10 2 830.00 830.00 0.00 3000.0
11 2 299.00 299.00 0.00 3000.0
12 2 14000.00 12000.00 2000.00 3000.0
13 2 1500.00 500.00 1000.00 3000.0
我们可以使用 Series.where
to replace the values that don't match the condition with NaN
, then just groupby transform
'sum' 因为 NaN
值默认被 'sum' 忽略:
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0).groupby(df['ID']).transform('sum')
)
或显式替换为不影响总和的加法恒等式(0):
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0, 0)
.groupby(df['ID']).transform('sum')
)
或者在 groupby transform
里面加上 lambda
:
df['Overspend Total'] = df.groupby('ID')['Variance'].transform(
lambda s: s[s > 0].sum()
)
无论如何df
是:
ID Start End Variance Overspend Total
0 1 100000.00 120000.00 20000.00 20000.0
1 1 1.00 0.00 -1.00 20000.0
2 1 7815.58 7815.58 0.00 20000.0
3 1 5261.00 5261.00 0.00 20000.0
4 1 138783.20 89969.37 -48813.83 20000.0
5 1 2459.92 2459.92 0.00 20000.0
6 2 101421.99 93387.45 -8034.54 3000.0
7 2 940.04 940.04 0.00 3000.0
8 2 63.06 63.06 0.00 3000.0
9 2 2454.86 2454.86 0.00 3000.0
10 2 830.00 830.00 0.00 3000.0
11 2 299.00 299.00 0.00 3000.0
12 2 14000.00 12000.00 2000.00 3000.0
13 2 1500.00 500.00 1000.00 3000.0
我有以下数据框:
ID Start End Variance
1 100000 120000 20000
1 1 0 -1
1 7815.58 7815.58 0
1 5261 5261 0
1 138783.2 89969.37 -48813.83
1 2459.92 2459.92 0
2 101421.99 93387.45 -8034.54
2 940.04 940.04 0
2 63.06 63.06 0
2 2454.86 2454.86 0
2 830 830 0
2 299 299 0
2 14000 12000 2000
2 1500 500 1000
我想创建一个新专栏,Overspend Total
。但我只想对大于 0 的值求和。生成的 DataFrame 将如下所示:
ID Start End Variance Overspend Total
1 100000 120000 20000 20000
1 1 0 -1 20000
1 7815.58 7815.58 0 20000
1 5261 5261 0 20000
1 138783.2 89969.37 -48813.83 20000
1 2459.92 2459.92 0 20000
2 101421.99 93387.45 -8034.54 3000
2 940.04 940.04 0 3000
2 63.06 63.06 0 3000
2 2454.86 2454.86 0 3000
2 830 830 0 3000
2 299 299 0 3000
2 14000 12000 2000 3000
2 1500 500 1000 3000
我尝试了以下方法
df['Overspend Variance'] = df[df['Variance'] > 0].groupby(df['ID']).transform('sum')
但我收到以下错误:
ValueError: Wrong number of items passed 8, placement implies 1
我知道 df['Overspend Variance'] = df['Variance'].groupby(df['ID']).transform('sum')
可以在没有条件的情况下工作,但我不知道如何将它与额外条件结合起来。
可以通过筛选小于 0 的值而不是 group by 并重新分配来完成
df = df.join(df[df.Variance>=0].groupby("ID")["Variance"].agg(sum), on="ID", rsuffix="total")
df.columns = ["ID", "Start", "End", "Variance", "Overspend Total"]
ID Start End Variance Overspend Total
0 1 100000.00 120000.00 20000.00 20000.0
1 1 1.00 0.00 -1.00 20000.0
2 1 7815.58 7815.58 0.00 20000.0
3 1 5261.00 5261.00 0.00 20000.0
4 1 138783.20 89969.37 -48813.83 20000.0
5 1 2459.92 2459.92 0.00 20000.0
6 2 101421.99 93387.45 -8034.54 3000.0
7 2 940.04 940.04 0.00 3000.0
8 2 63.06 63.06 0.00 3000.0
9 2 2454.86 2454.86 0.00 3000.0
10 2 830.00 830.00 0.00 3000.0
11 2 299.00 299.00 0.00 3000.0
12 2 14000.00 12000.00 2000.00 3000.0
13 2 1500.00 500.00 1000.00 3000.0
我们可以使用 Series.where
to replace the values that don't match the condition with NaN
, then just groupby transform
'sum' 因为 NaN
值默认被 'sum' 忽略:
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0).groupby(df['ID']).transform('sum')
)
或显式替换为不影响总和的加法恒等式(0):
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0, 0)
.groupby(df['ID']).transform('sum')
)
或者在 groupby transform
里面加上 lambda
:
df['Overspend Total'] = df.groupby('ID')['Variance'].transform(
lambda s: s[s > 0].sum()
)
无论如何df
是:
ID Start End Variance Overspend Total
0 1 100000.00 120000.00 20000.00 20000.0
1 1 1.00 0.00 -1.00 20000.0
2 1 7815.58 7815.58 0.00 20000.0
3 1 5261.00 5261.00 0.00 20000.0
4 1 138783.20 89969.37 -48813.83 20000.0
5 1 2459.92 2459.92 0.00 20000.0
6 2 101421.99 93387.45 -8034.54 3000.0
7 2 940.04 940.04 0.00 3000.0
8 2 63.06 63.06 0.00 3000.0
9 2 2454.86 2454.86 0.00 3000.0
10 2 830.00 830.00 0.00 3000.0
11 2 299.00 299.00 0.00 3000.0
12 2 14000.00 12000.00 2000.00 3000.0
13 2 1500.00 500.00 1000.00 3000.0