Pandas:计算按日期分组的差异列和附加列
Pandas: Calculate diff column grouped by date and additional column
我有 Pandas 个包含 3 列的 DataFrame:
df = pd.DataFrame({'product__sku': [1, 1, 1, 1, 2, 2],
'date': ['2021-10-01 20:48:12+00:00','2021-10-31 20:48:26+00:00',
'2021-09-01 20:48:12+00:00','2021-09-30 20:48:26+00:00',
'2021-10-01 12:23:17+00:00','2021-10-31 12:23:17+00:00'],
'qty': [100, 84, 5, 10, 15, 48]})
看起来像:
|product__sku | date | qty |
|1 | 2021-10-01 20:48:12+00:00 | 100 |
|1 | 2021-10-31 20:48:26+00:00 | 84 |
|1 | 2021-09-01 20:48:12+00:00 | 5 |
|1 | 2021-09-30 20:48:26+00:00 | 10 |
|2 | 2021-10-01 12:23:17+00:00 | 15 |
|2 | 2021-10-31 12:23:17+00:00 | 48 |
我需要按日期(月份)和 product__sku 两列进行分组。在 group_by I 列 'qty' 应该减去(diff)公式 max_date qty - min_date qty
我希望看到的结果
|product__sku | date | diff |
|1 | 2021-09-30 20:48:12+00:00 | 5 |
|1 | 2021-10-31 20:48:12+00:00 | -16 |
|2 | 2021-10-31 20:48:26+00:00 | 33 |
我试过用石斑鱼
dg = df.groupby([ pd.Grouper('product__sku'), pd.Grouper(key='date', freq='1M')])['qty'].diff().fillna(0)
但得到了不同的结果:
|0 0.0
| 1 -16.0
| 2 0.0
Name: qty, dtype: float64
使用 GroupBy.agg
with first
and last in sorted DataFrame, so get values for minimal and maximal dates, last subtract values with DataFrame.pop
删除列 first, last
:
如果每个组需要最后 date
秒,也对 date
列使用命名聚合:
df['date'] = pd.to_datetime(df['date'])
dg = (df.sort_values(['product__sku','date'])
.groupby(['product__sku', pd.Grouper(key='date', freq='1M')])
.agg(first=('qty','first'),last=('qty','last'), date=('date', 'first'))
.reset_index(level=-1, drop=True)
.reset_index()
)
dg['diff'] = dg.pop('last').sub(dg.pop('first'))
print (dg)
product__sku date diff
0 1 2021-09-01 20:48:12+00:00 5
1 1 2021-10-01 20:48:12+00:00 -16
2 2 2021-10-01 12:23:17+00:00 33
第一组 product__sku
和 month
。然后定义一个自定义函数,找到每个组中最大和最小日期之间的 qty
差异并将其应用于每个组:
def func(x):
dates = x['date'].sort_values()
diff = x.loc[dates.index[-1], 'qty'] - x.loc[dates.index[0], 'qty']
x = x[x['date']==dates.iloc[-1]]
x['diff'] = diff
return x[['product__sku','date','diff']]
df['date'] = pd.to_datetime(df['date'])
df = df.assign(month=df['date'].dt.month).groupby(['product__sku','month']).apply(func).reset_index(drop=True)
输出:
product__sku date diff
0 1 2021-09-30 20:48:26+00:00 5
1 1 2021-10-31 20:48:26+00:00 -16
2 2 2021-10-31 12:23:17+00:00 33
我有 Pandas 个包含 3 列的 DataFrame:
df = pd.DataFrame({'product__sku': [1, 1, 1, 1, 2, 2],
'date': ['2021-10-01 20:48:12+00:00','2021-10-31 20:48:26+00:00',
'2021-09-01 20:48:12+00:00','2021-09-30 20:48:26+00:00',
'2021-10-01 12:23:17+00:00','2021-10-31 12:23:17+00:00'],
'qty': [100, 84, 5, 10, 15, 48]})
看起来像:
|product__sku | date | qty |
|1 | 2021-10-01 20:48:12+00:00 | 100 |
|1 | 2021-10-31 20:48:26+00:00 | 84 |
|1 | 2021-09-01 20:48:12+00:00 | 5 |
|1 | 2021-09-30 20:48:26+00:00 | 10 |
|2 | 2021-10-01 12:23:17+00:00 | 15 |
|2 | 2021-10-31 12:23:17+00:00 | 48 |
我需要按日期(月份)和 product__sku 两列进行分组。在 group_by I 列 'qty' 应该减去(diff)公式 max_date qty - min_date qty
我希望看到的结果
|product__sku | date | diff |
|1 | 2021-09-30 20:48:12+00:00 | 5 |
|1 | 2021-10-31 20:48:12+00:00 | -16 |
|2 | 2021-10-31 20:48:26+00:00 | 33 |
我试过用石斑鱼
dg = df.groupby([ pd.Grouper('product__sku'), pd.Grouper(key='date', freq='1M')])['qty'].diff().fillna(0)
但得到了不同的结果:
|0 0.0
| 1 -16.0
| 2 0.0
Name: qty, dtype: float64
使用 GroupBy.agg
with first
and last in sorted DataFrame, so get values for minimal and maximal dates, last subtract values with DataFrame.pop
删除列 first, last
:
如果每个组需要最后 date
秒,也对 date
列使用命名聚合:
df['date'] = pd.to_datetime(df['date'])
dg = (df.sort_values(['product__sku','date'])
.groupby(['product__sku', pd.Grouper(key='date', freq='1M')])
.agg(first=('qty','first'),last=('qty','last'), date=('date', 'first'))
.reset_index(level=-1, drop=True)
.reset_index()
)
dg['diff'] = dg.pop('last').sub(dg.pop('first'))
print (dg)
product__sku date diff
0 1 2021-09-01 20:48:12+00:00 5
1 1 2021-10-01 20:48:12+00:00 -16
2 2 2021-10-01 12:23:17+00:00 33
第一组 product__sku
和 month
。然后定义一个自定义函数,找到每个组中最大和最小日期之间的 qty
差异并将其应用于每个组:
def func(x):
dates = x['date'].sort_values()
diff = x.loc[dates.index[-1], 'qty'] - x.loc[dates.index[0], 'qty']
x = x[x['date']==dates.iloc[-1]]
x['diff'] = diff
return x[['product__sku','date','diff']]
df['date'] = pd.to_datetime(df['date'])
df = df.assign(month=df['date'].dt.month).groupby(['product__sku','month']).apply(func).reset_index(drop=True)
输出:
product__sku date diff
0 1 2021-09-30 20:48:26+00:00 5
1 1 2021-10-31 20:48:26+00:00 -16
2 2 2021-10-31 12:23:17+00:00 33