Pandas 时间序列重采样
Pandas time series resampling
我有一个航程列表,其中包含开始和结束日期以及该航程的收入。我想计算月收入,但我不确定如何使用 Pandas:
'2016-02-28 07:30:00', '2016-04-30 00:00:00', '600000'
'2016-05-18 10:30:00', '2016-07-12 02:19:00', '700000'
我手动执行此操作的方法是计算每个月的航程天数,然后乘以 earnings/total 航程长度。
您需要检查每一行中每个日期范围内的小时数。所以使用 DataFrame.apply
with custom function, where groupby
by months
in date_range
and aggreagate size
.
print (df)
start end price
0 2016-02-28 07:30:00 2016-04-30 00:00:00 600000
1 2016-05-18 10:30:00 2016-07-12 02:19:00 700000
print (df.dtypes)
start datetime64[ns]
end datetime64[ns]
price int64
dtype: object
def f(x):
rng = pd.date_range(x.start, x.end, freq='H')
return rng.to_series().groupby([rng.month]).size()
df1 = df.apply(f, axis=1)
print (df1)
2 3 4 5 6 7
0 41.0 744.0 696.0 NaN NaN NaN
1 NaN NaN NaN 326.0 720.0 266.0
然后将 price
列除以所有小时的 sum
,得到 price_per_hour
:
price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0 405.131668
1 533.536585
dtype: float64
每个 month
中所有小时的最后 mul
倍数:
print (df1.mul(price_per_hour, axis=0))
2 3 4 5 6 \
0 16610.398379 301417.960837 281971.640783 NaN NaN
1 NaN NaN NaN 173932.926829 384146.341463
7
0 NaN
1 141920.731707
#check sum - it is correctly price
print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0 600000.0
1 700000.0
dtype: float64
您也可以将 prices
计算为 days
- 将 freq='h'
更改为 freq='D'
,但我认为它不太准确:
def f(x):
rng = pd.date_range(x.start, x.end, freq='D')
return rng.to_series().groupby([rng.month]).size()
df1 = df.apply(f, axis=1)
print (df1)
2 3 4 5 6 7
0 2.0 31.0 29.0 NaN NaN NaN
1 NaN NaN NaN 14.0 30.0 11.0
price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0 9677.419355
1 12727.272727
dtype: float64
print (df1.mul(price_per_hour, axis=0))
2 3 4 5 6 7
0 19354.83871 300000.0 280645.16129 NaN NaN NaN
1 NaN NaN NaN 178181.818182 381818.181818 140000.0
0 600000.0
1 700000.0
dtype: float64
print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0 600000.0
1 700000.0
dtype: float64
通过 melt
, groupby and resample resample
- also need groupby
by months
and aggreagate size
重塑的另一种解决方案:
df['count'] = df.index
df1 = pd.melt(df, id_vars=['price', 'count'], value_name='dates')
print (df1)
price count variable dates
0 600000 0 start 2016-02-28 07:30:00
1 700000 1 start 2016-05-18 10:30:00
2 600000 0 end 2016-04-30 00:00:00
3 700000 1 end 2016-07-12 02:19:00
df2 = df1.set_index('dates').groupby('count').resample('D').size()
print (df2)
count dates
0 2016-02-28 1
2016-02-29 0
2016-03-01 0
2016-03-02 0
2016-03-03 0
2016-03-04 0
2016-03-05 0
2016-03-06 0
2016-03-07 0
2016-03-08 0
2016-03-09 0
2016-03-10 0
2016-03-11 0
2016-03-12 0
...
...
print (df2.index.get_level_values('dates').month)
[2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5
5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7
7 7 7 7 7 7 7 7]
df3 = df2.groupby([df2.index.get_level_values('count'),
df2.index.get_level_values('dates').month]).size().unstack()
print (df3)
2 3 4 5 6 7
count
0 2.0 31.0 30.0 NaN NaN NaN
1 NaN NaN NaN 14.0 30.0 12.0
price_per_hour = df.price / df3.sum(axis=1)
print (price_per_hour)
0 9523.809524
1 12500.000000
dtype: float64
print (df3.mul(price_per_hour, axis=0))
2 3 4 5 6 \
count
0 19047.619048 295238.095238 285714.285714 NaN NaN
1 NaN NaN NaN 175000.0 375000.0
7
count
0 NaN
1 150000.0
print (df3.mul(price_per_hour, axis=0).sum(axis=1))
count
0 600000.0
1 700000.0
dtype: float64
我有一个航程列表,其中包含开始和结束日期以及该航程的收入。我想计算月收入,但我不确定如何使用 Pandas:
'2016-02-28 07:30:00', '2016-04-30 00:00:00', '600000'
'2016-05-18 10:30:00', '2016-07-12 02:19:00', '700000'
我手动执行此操作的方法是计算每个月的航程天数,然后乘以 earnings/total 航程长度。
您需要检查每一行中每个日期范围内的小时数。所以使用 DataFrame.apply
with custom function, where groupby
by months
in date_range
and aggreagate size
.
print (df)
start end price
0 2016-02-28 07:30:00 2016-04-30 00:00:00 600000
1 2016-05-18 10:30:00 2016-07-12 02:19:00 700000
print (df.dtypes)
start datetime64[ns]
end datetime64[ns]
price int64
dtype: object
def f(x):
rng = pd.date_range(x.start, x.end, freq='H')
return rng.to_series().groupby([rng.month]).size()
df1 = df.apply(f, axis=1)
print (df1)
2 3 4 5 6 7
0 41.0 744.0 696.0 NaN NaN NaN
1 NaN NaN NaN 326.0 720.0 266.0
然后将 price
列除以所有小时的 sum
,得到 price_per_hour
:
price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0 405.131668
1 533.536585
dtype: float64
每个 month
中所有小时的最后 mul
倍数:
print (df1.mul(price_per_hour, axis=0))
2 3 4 5 6 \
0 16610.398379 301417.960837 281971.640783 NaN NaN
1 NaN NaN NaN 173932.926829 384146.341463
7
0 NaN
1 141920.731707
#check sum - it is correctly price
print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0 600000.0
1 700000.0
dtype: float64
您也可以将 prices
计算为 days
- 将 freq='h'
更改为 freq='D'
,但我认为它不太准确:
def f(x):
rng = pd.date_range(x.start, x.end, freq='D')
return rng.to_series().groupby([rng.month]).size()
df1 = df.apply(f, axis=1)
print (df1)
2 3 4 5 6 7
0 2.0 31.0 29.0 NaN NaN NaN
1 NaN NaN NaN 14.0 30.0 11.0
price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0 9677.419355
1 12727.272727
dtype: float64
print (df1.mul(price_per_hour, axis=0))
2 3 4 5 6 7
0 19354.83871 300000.0 280645.16129 NaN NaN NaN
1 NaN NaN NaN 178181.818182 381818.181818 140000.0
0 600000.0
1 700000.0
dtype: float64
print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0 600000.0
1 700000.0
dtype: float64
通过 melt
, groupby and resample resample
- also need groupby
by months
and aggreagate size
重塑的另一种解决方案:
df['count'] = df.index
df1 = pd.melt(df, id_vars=['price', 'count'], value_name='dates')
print (df1)
price count variable dates
0 600000 0 start 2016-02-28 07:30:00
1 700000 1 start 2016-05-18 10:30:00
2 600000 0 end 2016-04-30 00:00:00
3 700000 1 end 2016-07-12 02:19:00
df2 = df1.set_index('dates').groupby('count').resample('D').size()
print (df2)
count dates
0 2016-02-28 1
2016-02-29 0
2016-03-01 0
2016-03-02 0
2016-03-03 0
2016-03-04 0
2016-03-05 0
2016-03-06 0
2016-03-07 0
2016-03-08 0
2016-03-09 0
2016-03-10 0
2016-03-11 0
2016-03-12 0
...
...
print (df2.index.get_level_values('dates').month)
[2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5
5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7
7 7 7 7 7 7 7 7]
df3 = df2.groupby([df2.index.get_level_values('count'),
df2.index.get_level_values('dates').month]).size().unstack()
print (df3)
2 3 4 5 6 7
count
0 2.0 31.0 30.0 NaN NaN NaN
1 NaN NaN NaN 14.0 30.0 12.0
price_per_hour = df.price / df3.sum(axis=1)
print (price_per_hour)
0 9523.809524
1 12500.000000
dtype: float64
print (df3.mul(price_per_hour, axis=0))
2 3 4 5 6 \
count
0 19047.619048 295238.095238 285714.285714 NaN NaN
1 NaN NaN NaN 175000.0 375000.0
7
count
0 NaN
1 150000.0
print (df3.mul(price_per_hour, axis=0).sum(axis=1))
count
0 600000.0
1 700000.0
dtype: float64