Python/Pandas - 连接 2 个具有不同 PeriodIndex 频率的数据帧
Python/Pandas - Concatenating 2 DataFrames with different PeriodIndex frequencies
我想连接2个具有不同PeriodIndex频率的DataFrame,并用于对作为位置的二级索引进行排序。
例如,我有以下 2 个 DataFrame。
import pandas as pd
pr1h = pd.period_range(start='2020-01-01 08:00', end='2020-01-01 11:00', freq='1h')
pr2h = pd.period_range(start='2020-01-01 08:00', end='2020-01-01 11:00', freq='2h')
n_array_1h = [2, 2, 2, 2]
n_array_2h = [0, 1, 0, 1]
index_labels_1h = [pr1h, n_array_1h]
index_labels_2h = [[pr2h[0],pr2h[0],pr2h[1],pr2h[1]], n_array_2h]
values_1h = [[1], [2], [3], [4]]
values_2h = [[10], [20], [30], [40]]
df1h = pd.DataFrame(values_1h, index=index_labels_1h, columns=['Data'])
df1h.index.names=['Period','Position']
df2h = pd.DataFrame(values_2h, index=index_labels_2h, columns=['Data'])
df2h.index.names=['Period','Position']
df1h
Data
Period Position
2020-01-01 08:00 2 1
2020-01-01 09:00 2 2
2020-01-01 10:00 2 3
2020-01-01 11:00 2 4
df2h
Data
Period Position
2020-01-01 08:00 0 10
1 20
2020-01-01 10:00 0 30
1 40
我想获得 df1h_new,其中:
- 保留 df1h 的 PeriodIndex,
- 保留 df2h 中块的数据 period.start_time 立即低于或等于 df1h 中的当前 perdiod.start_time,
- 显然保留了 df1h 的数据
所以结果是。
df1h_new
Data
Period Position
2020-01-01 08:00 0 10 # |---> data coming from df2h, block with
1 20 # | start_time =< df1h.index[0].start_time
2 1 # ----> data from df1h.index[0]
2020-01-01 09:00 0 10 # |---> data coming from df2h, block with
1 20 # | start_time =< df1h.index[1].start_time
2 2 # ----> data from df1h.index[1]
2020-01-01 10:00 0 30 # and so on...
1 40
2 3
2020-01-01 11:00 0 30
1 40
2 4
请问,实现该目标的推荐方法是什么?
我感谢您的帮助和支持!最佳,
一个想法是使用 concat
with Series.unstack
and change frequency to same by Series.asfreq
,然后回填缺失值并重塑回 MultiIndex
:
df = (pd.concat([df1h['Data'].unstack(),
df2h['Data'].unstack().asfreq('H')], axis=1)
.bfill()
.stack()
.sort_index()
.to_frame('Data'))
print (df)
Data
Period Position
2020-01-01 08:00 0 10.0
1 20.0
2 1.0
2020-01-01 09:00 0 10.0
1 20.0
2 2.0
2020-01-01 10:00 0 30.0
1 40.0
2 3.0
2020-01-01 11:00 0 30.0
1 40.0
2 4.0
我想连接2个具有不同PeriodIndex频率的DataFrame,并用于对作为位置的二级索引进行排序。
例如,我有以下 2 个 DataFrame。
import pandas as pd
pr1h = pd.period_range(start='2020-01-01 08:00', end='2020-01-01 11:00', freq='1h')
pr2h = pd.period_range(start='2020-01-01 08:00', end='2020-01-01 11:00', freq='2h')
n_array_1h = [2, 2, 2, 2]
n_array_2h = [0, 1, 0, 1]
index_labels_1h = [pr1h, n_array_1h]
index_labels_2h = [[pr2h[0],pr2h[0],pr2h[1],pr2h[1]], n_array_2h]
values_1h = [[1], [2], [3], [4]]
values_2h = [[10], [20], [30], [40]]
df1h = pd.DataFrame(values_1h, index=index_labels_1h, columns=['Data'])
df1h.index.names=['Period','Position']
df2h = pd.DataFrame(values_2h, index=index_labels_2h, columns=['Data'])
df2h.index.names=['Period','Position']
df1h
Data
Period Position
2020-01-01 08:00 2 1
2020-01-01 09:00 2 2
2020-01-01 10:00 2 3
2020-01-01 11:00 2 4
df2h
Data
Period Position
2020-01-01 08:00 0 10
1 20
2020-01-01 10:00 0 30
1 40
我想获得 df1h_new,其中:
- 保留 df1h 的 PeriodIndex,
- 保留 df2h 中块的数据 period.start_time 立即低于或等于 df1h 中的当前 perdiod.start_time,
- 显然保留了 df1h 的数据
所以结果是。
df1h_new
Data
Period Position
2020-01-01 08:00 0 10 # |---> data coming from df2h, block with
1 20 # | start_time =< df1h.index[0].start_time
2 1 # ----> data from df1h.index[0]
2020-01-01 09:00 0 10 # |---> data coming from df2h, block with
1 20 # | start_time =< df1h.index[1].start_time
2 2 # ----> data from df1h.index[1]
2020-01-01 10:00 0 30 # and so on...
1 40
2 3
2020-01-01 11:00 0 30
1 40
2 4
请问,实现该目标的推荐方法是什么? 我感谢您的帮助和支持!最佳,
一个想法是使用 concat
with Series.unstack
and change frequency to same by Series.asfreq
,然后回填缺失值并重塑回 MultiIndex
:
df = (pd.concat([df1h['Data'].unstack(),
df2h['Data'].unstack().asfreq('H')], axis=1)
.bfill()
.stack()
.sort_index()
.to_frame('Data'))
print (df)
Data
Period Position
2020-01-01 08:00 0 10.0
1 20.0
2 1.0
2020-01-01 09:00 0 10.0
1 20.0
2 2.0
2020-01-01 10:00 0 30.0
1 40.0
2 3.0
2020-01-01 11:00 0 30.0
1 40.0
2 4.0