Pandas groupby 时间序列的 bin 计数
Pandas groupby with bin counts for timeseries
在样本数据帧上
data = pd.DataFrame(np.random.rand(6,2), columns = list('ab'))
dti = pd.date_range(start='2019-02-12', end='2019-02-12', periods=6)
data.set_index(dti, inplace=True)
产量:
a b
2019-02-12 00:00:00 0.909822 0.548713
2019-02-12 01:00:00 0.295730 0.452881
2019-02-12 02:00:00 0.889976 0.042893
2019-02-12 03:00:00 0.466465 0.971178
2019-02-12 04:00:00 0.532618 0.769210
2019-02-12 05:00:00 0.947362 0.021689
现在,如何在两列上混合使用分组和装箱功能?
假设我有 bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
,我如何在列 a
上装箱 data
并在 col b
上获得 mean
(或最大、最小、总和等)每天、每周、每月的每个箱子?
使用cut
with DatetimeIndex.day
, or DatetimeIndex.week
, DatetimeIndex.month
并汇总min
或max
、mean
、sum
:
bins = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
s = pd.cut(data['a'], bins=bins, labels=labels)
df = data.groupby([data.index.day.rename('day'), s])['b'].min().reset_index()
#df = data.groupby([data.index.week.rename('week'), s])['b'].min().reset_index()
#df = data.groupby([data.index.month.rename('month'), s])['b'].min().reset_index()
print (df)
day a b
0 12 1.4-0.6 0.267070
1 12 1.6-0.8 0.637877
2 12 1.8-1.0 0.299172
也可以通过 DataFrameGroupBy.agg
传递多个函数
df2 = (data.groupby([data.index.day.rename('day'), s])['b']
.agg(['min','max','sum','mean'])
.reset_index())
print (df2)
day a min max sum mean
0 12 1.4-0.6 0.267070 0.267070 0.267070 0.267070
1 12 1.6-0.8 0.637877 0.903206 1.541084 0.770542
2 12 1.8-1.0 0.299172 0.405750 1.098002 0.366001
df3 = (data.groupby([data.index.day.rename('day'), s])['b']
.describe()
.reset_index())
print (df3)
day a count mean std min 25% 50% \
0 12 1.4-0.6 1.0 0.267070 NaN 0.267070 0.267070 0.267070
1 12 1.6-0.8 2.0 0.770542 0.187616 0.637877 0.704210 0.770542
2 12 1.8-1.0 3.0 0.366001 0.058221 0.299172 0.346126 0.393081
75% max
0 0.267070 0.267070
1 0.836874 0.903206
2 0.399415 0.405750
在样本数据帧上
data = pd.DataFrame(np.random.rand(6,2), columns = list('ab'))
dti = pd.date_range(start='2019-02-12', end='2019-02-12', periods=6)
data.set_index(dti, inplace=True)
产量:
a b
2019-02-12 00:00:00 0.909822 0.548713
2019-02-12 01:00:00 0.295730 0.452881
2019-02-12 02:00:00 0.889976 0.042893
2019-02-12 03:00:00 0.466465 0.971178
2019-02-12 04:00:00 0.532618 0.769210
2019-02-12 05:00:00 0.947362 0.021689
现在,如何在两列上混合使用分组和装箱功能?
假设我有 bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
,我如何在列 a
上装箱 data
并在 col b
上获得 mean
(或最大、最小、总和等)每天、每周、每月的每个箱子?
使用cut
with DatetimeIndex.day
, or DatetimeIndex.week
, DatetimeIndex.month
并汇总min
或max
、mean
、sum
:
bins = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
s = pd.cut(data['a'], bins=bins, labels=labels)
df = data.groupby([data.index.day.rename('day'), s])['b'].min().reset_index()
#df = data.groupby([data.index.week.rename('week'), s])['b'].min().reset_index()
#df = data.groupby([data.index.month.rename('month'), s])['b'].min().reset_index()
print (df)
day a b
0 12 1.4-0.6 0.267070
1 12 1.6-0.8 0.637877
2 12 1.8-1.0 0.299172
也可以通过 DataFrameGroupBy.agg
df2 = (data.groupby([data.index.day.rename('day'), s])['b']
.agg(['min','max','sum','mean'])
.reset_index())
print (df2)
day a min max sum mean
0 12 1.4-0.6 0.267070 0.267070 0.267070 0.267070
1 12 1.6-0.8 0.637877 0.903206 1.541084 0.770542
2 12 1.8-1.0 0.299172 0.405750 1.098002 0.366001
df3 = (data.groupby([data.index.day.rename('day'), s])['b']
.describe()
.reset_index())
print (df3)
day a count mean std min 25% 50% \
0 12 1.4-0.6 1.0 0.267070 NaN 0.267070 0.267070 0.267070
1 12 1.6-0.8 2.0 0.770542 0.187616 0.637877 0.704210 0.770542
2 12 1.8-1.0 3.0 0.366001 0.058221 0.299172 0.346126 0.393081
75% max
0 0.267070 0.267070
1 0.836874 0.903206
2 0.399415 0.405750