Pandas:计算按日期分组的差异列和附加列

Pandas: Calculate diff column grouped by date and additional column

我有 Pandas 个包含 3 列的 DataFrame:

df = pd.DataFrame({'product__sku': [1, 1, 1, 1, 2, 2],
                   'date': ['2021-10-01 20:48:12+00:00','2021-10-31 20:48:26+00:00',
                            '2021-09-01 20:48:12+00:00','2021-09-30 20:48:26+00:00',
                            '2021-10-01 12:23:17+00:00','2021-10-31 12:23:17+00:00'],
                   'qty': [100, 84, 5, 10, 15, 48]})

看起来像:

|product__sku | date                      |  qty |
|1            | 2021-10-01 20:48:12+00:00 |  100 |
|1            | 2021-10-31 20:48:26+00:00 |  84  |
|1            | 2021-09-01 20:48:12+00:00 |  5   |
|1            | 2021-09-30 20:48:26+00:00 |  10  |
|2            | 2021-10-01 12:23:17+00:00 |  15  |
|2            | 2021-10-31 12:23:17+00:00 |  48  |

我需要按日期(月份)和 product__sku 两列进行分组。在 group_by I 列 'qty' 应该减去(diff)公式 max_date qty - min_date qty

我希望看到的结果

|product__sku | date                      | diff |
|1            | 2021-09-30 20:48:12+00:00 |  5   |
|1            | 2021-10-31 20:48:12+00:00 |  -16 |
|2            | 2021-10-31 20:48:26+00:00 |  33  |

我试过用石斑鱼

        dg = df.groupby([ pd.Grouper('product__sku'), pd.Grouper(key='date', freq='1M')])['qty'].diff().fillna(0)

但得到了不同的结果:

|0     0.0
| 1   -16.0
| 2     0.0
Name: qty, dtype: float64

使用 GroupBy.agg with first and last in sorted DataFrame, so get values for minimal and maximal dates, last subtract values with DataFrame.pop 删除列 first, last:

如果每个组需要最后 date 秒,也对 date 列使用命名聚合:

df['date'] = pd.to_datetime(df['date'])

dg = (df.sort_values(['product__sku','date'])
        .groupby(['product__sku', pd.Grouper(key='date', freq='1M')])
        .agg(first=('qty','first'),last=('qty','last'), date=('date', 'first'))
        .reset_index(level=-1, drop=True)
        .reset_index()
        )
dg['diff'] = dg.pop('last').sub(dg.pop('first'))
print (dg)
   product__sku                      date  diff
0             1 2021-09-01 20:48:12+00:00     5
1             1 2021-10-01 20:48:12+00:00   -16
2             2 2021-10-01 12:23:17+00:00    33

第一组 product__skumonth。然后定义一个自定义函数,找到每个组中最大和最小日期之间的 qty 差异并将其应用于每个组:

def func(x):
    dates = x['date'].sort_values()
    diff = x.loc[dates.index[-1], 'qty'] - x.loc[dates.index[0], 'qty']
    x = x[x['date']==dates.iloc[-1]]
    x['diff'] = diff
    return x[['product__sku','date','diff']]
    

df['date'] = pd.to_datetime(df['date'])
df = df.assign(month=df['date'].dt.month).groupby(['product__sku','month']).apply(func).reset_index(drop=True)

输出:

   product__sku                      date  diff
0             1 2021-09-30 20:48:26+00:00     5
1             1 2021-10-31 20:48:26+00:00   -16
2             2 2021-10-31 12:23:17+00:00    33