Groupby 和 Sum - 创建添加了 If 条件的新列

Groupby & Sum - Create new column with added If Condition

我有以下数据框:

ID  Start           End              Variance
1   100000          120000           20000
1   1               0                -1
1   7815.58         7815.58          0
1   5261            5261             0
1   138783.2        89969.37         -48813.83
1   2459.92         2459.92          0
2   101421.99       93387.45         -8034.54
2   940.04          940.04           0
2   63.06           63.06            0
2   2454.86         2454.86          0
2   830             830              0
2   299             299              0
2   14000           12000            2000
2   1500            500              1000


我想创建一个新专栏,Overspend Total。但我只想对大于 0 的值求和。生成的 DataFrame 将如下所示:

ID  Start           End              Variance        Overspend Total
1   100000          120000           20000           20000
1   1               0                -1              20000
1   7815.58         7815.58          0               20000
1   5261            5261             0               20000
1   138783.2        89969.37         -48813.83       20000
1   2459.92         2459.92          0               20000
2   101421.99       93387.45         -8034.54        3000
2   940.04          940.04           0               3000
2   63.06           63.06            0               3000
2   2454.86         2454.86          0               3000
2   830             830              0               3000
2   299             299              0               3000
2   14000           12000            2000            3000
2   1500            500              1000            3000

我尝试了以下方法

df['Overspend Variance'] = df[df['Variance'] > 0].groupby(df['ID']).transform('sum')

但我收到以下错误:

ValueError: Wrong number of items passed 8, placement implies 1

我知道 df['Overspend Variance'] = df['Variance'].groupby(df['ID']).transform('sum') 可以在没有条件的情况下工作,但我不知道如何将它与额外条件结合起来。

可以通过筛选小于 0 的值而不是 group by 并重新分配来完成

df = df.join(df[df.Variance>=0].groupby("ID")["Variance"].agg(sum),  on="ID", rsuffix="total")
df.columns = ["ID", "Start", "End", "Variance", "Overspend Total"]

    ID  Start   End Variance    Overspend Total
0   1   100000.00   120000.00   20000.00    20000.0
1   1   1.00    0.00    -1.00   20000.0
2   1   7815.58 7815.58 0.00    20000.0
3   1   5261.00 5261.00 0.00    20000.0
4   1   138783.20   89969.37    -48813.83   20000.0
5   1   2459.92 2459.92 0.00    20000.0
6   2   101421.99   93387.45    -8034.54    3000.0
7   2   940.04  940.04  0.00    3000.0
8   2   63.06   63.06   0.00    3000.0
9   2   2454.86 2454.86 0.00    3000.0
10  2   830.00  830.00  0.00    3000.0
11  2   299.00  299.00  0.00    3000.0
12  2   14000.00    12000.00    2000.00 3000.0
13  2   1500.00 500.00  1000.00 3000.0

我们可以使用 Series.where to replace the values that don't match the condition with NaN, then just groupby transform 'sum' 因为 NaN 值默认被 'sum' 忽略:

df['Overspend Total'] = (
    df['Variance'].where(df['Variance'] > 0).groupby(df['ID']).transform('sum')
)

或显式替换为不影响总和的加法恒等式(0):

df['Overspend Total'] = (
    df['Variance'].where(df['Variance'] > 0, 0)
        .groupby(df['ID']).transform('sum')
)

或者在 groupby transform 里面加上 lambda:

df['Overspend Total'] = df.groupby('ID')['Variance'].transform(
    lambda s: s[s > 0].sum()
)

无论如何df是:

    ID      Start        End  Variance  Overspend Total
0    1  100000.00  120000.00  20000.00          20000.0
1    1       1.00       0.00     -1.00          20000.0
2    1    7815.58    7815.58      0.00          20000.0
3    1    5261.00    5261.00      0.00          20000.0
4    1  138783.20   89969.37 -48813.83          20000.0
5    1    2459.92    2459.92      0.00          20000.0
6    2  101421.99   93387.45  -8034.54           3000.0
7    2     940.04     940.04      0.00           3000.0
8    2      63.06      63.06      0.00           3000.0
9    2    2454.86    2454.86      0.00           3000.0
10   2     830.00     830.00      0.00           3000.0
11   2     299.00     299.00      0.00           3000.0
12   2   14000.00   12000.00   2000.00           3000.0
13   2    1500.00     500.00   1000.00           3000.0