使用 reindex 将缺失日期添加到数据框替换数据
Adding missing dates to dataframe using reindex replaces data
我正在尝试将缺失的日期添加到我的数据框中。
我看过这个帖子: and reindex2。
当我尝试重新索引我的数据框时:
print(df)
df = df.reindex(dates, fill_value=0)
print(df)
我得到以下输出:
_updated_at Name hour day date time data1 data2
06/06/2016 13:27 game_name 13 6 06/06/2016 evening 0 0
07/06/2016 10:33 game_name 10 7 07/06/2016 morning 145.2788 122.7361
18/10/2016 14:34 game_name 14 18 18/10/2016 evening 0 0
19/10/2016 17:12 game_name 17 19 19/10/2016 evening 0 0
24/10/2016 11:05 game_name 11 24 24/10/2016 morning 313.5954 364.4107
24/10/2016 12:02 game_name 12 24 24/10/2016 evening 0 0
25/10/2016 08:50 game_name 8 25 25/10/2016 morning 362.4682 431.5803
25/10/2016 13:00 game_name 13 25 25/10/2016 evening 0 0
_updated_at Name hour day date time data1 data2
24/10/2016 0 0 0 0 0 0 0
25/10/2016 0 0 0 0 0 0 0
26/10/2016 0 0 0 0 0 0 0
27/10/2016 0 0 0 0 0 0 0
28/10/2016 0 0 0 0 0 0 0
29/10/2016 0 0 0 0 0 0 0
30/10/2016 0 0 0 0 0 0 0
我希望看到缺少日期的行用新行和每个值中的 0 填充,而不是用 0 替换所有行。
编辑:
总体目标是能够计算出每天早晚差异的值之间的差异。
编辑2:
当前输出:
print (df.reindex(mux, fill_value=0).groupby(level=0)['data1'].diff(-1).dropna())
dtypes: float64(2)None
2016-06-06 morning 0.00000
2016-06-07 morning 440.99582
2016-06-08 morning 0.00000
2016-06-09 morning 0.00000
2016-06-10 morning 0.00000
print (df.reindex(mux, fill_value=0).groupby(level=0)['data2'].diff(-1).dropna())
Length: 142, dtype: float64
2016-06-06 morning -220.5481
2016-06-07 morning 0.0000
2016-06-08 morning 0.0000
2016-06-09 morning 0.0000
2016-06-10 morning 0.0000
2016-06-11 morning 0.0000
我期待看到 evening
值
您可以 reindex
by MultiIndex.from_product
从列 dates
和 time
:
df.date = pd.to_datetime(df.date)
dates = pd.date_range(start=df.date.min(), end=df.date.max())
print (dates)
DatetimeIndex(['2016-06-06', '2016-06-07', '2016-06-08', '2016-06-09',
'2016-06-10', '2016-06-11', '2016-06-12', '2016-06-13',
'2016-06-14', '2016-06-15',
...
'2016-10-16', '2016-10-17', '2016-10-18', '2016-10-19',
'2016-10-20', '2016-10-21', '2016-10-22', '2016-10-23',
'2016-10-24', '2016-10-25'],
dtype='datetime64[ns]', length=142, freq='D')
mux = pd.MultiIndex.from_product([dates,['morning','evening']])
#print (mux)
df.set_index(['date','time'], inplace=True)
print (df.reindex(mux, fill_value=0))
_updated_at Name hour day data1 data2
2016-06-06 morning 0 0 0 0 0.0000 0.0000
evening 06/06/2016 13:27 game_name 13 6 0.0000 0.0000
2016-06-07 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-08 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-09 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-10 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-11 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-12 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-13 morning 0 0 0 0 0.0000 0.0000
...
最后你可以 groupby
by first level of Multiindex
(dates) with DataFrameGroupBy.diff
. You get for each dates row with NaN
which can be removed by dropna
:
print (df.reindex(mux, fill_value=0).groupby(level=0)['data1','data2'].diff(-1).dropna())
data1 data2
2016-06-06 morning 0.0000 0.0000
2016-06-07 morning 0.0000 0.0000
2016-06-08 morning 0.0000 0.0000
2016-06-09 morning 0.0000 0.0000
2016-06-10 morning 0.0000 0.0000
2016-06-11 morning 0.0000 0.0000
2016-06-12 morning 0.0000 0.0000
2016-06-13 morning 0.0000 0.0000
2016-06-14 morning 0.0000 0.0000
2016-06-15 morning 0.0000 0.0000
2016-06-16 morning 0.0000 0.0000
2016-06-17 morning 0.0000 0.0000
2016-06-18 morning 0.0000 0.0000
2016-06-19 morning 0.0000 0.0000
2016-06-20 morning 0.0000 0.0000
2016-06-21 morning 0.0000 0.0000
...
...
您还可以 select 通过 ix
并减去:
print (df.reindex(mux, fill_value=0)
.groupby(level=0)
.apply(lambda x: x.ix[0, ['data1','data2']]-x.ix[1, ['data1','data2']]))
data1 data2
2016-06-06 0.0000 0.0000
2016-06-07 0.0000 0.0000
2016-06-08 0.0000 0.0000
2016-06-09 0.0000 0.0000
2016-06-10 0.0000 0.0000
2016-06-11 0.0000 0.0000
2016-06-12 0.0000 0.0000
2016-06-13 0.0000 0.0000
2016-06-14 0.0000 0.0000
2016-06-15 0.0000 0.0000
2016-06-16 0.0000 0.0000
2016-06-17 0.0000 0.0000
2016-06-18 0.0000 0.0000
2016-06-19 0.0000 0.0000
2016-06-20 0.0000 0.0000
2016-06-21 0.0000 0.0000
2016-06-22 0.0000 0.0000
2016-06-23 0.0000 0.0000
2016-06-24 0.0000 0.0000
2016-06-25 0.0000 0.0000
2016-06-26 0.0000 0.0000
2016-06-27 0.0000 0.0000
2016-06-28 0.0000 0.0000
2016-06-29 0.0000 0.0000
2016-06-30 0.0000 0.0000
2016-07-01 0.0000 0.0000
2016-07-02 0.0000 0.0000
2016-07-03 0.0000 0.0000
2016-07-04 0.0000 0.0000
2016-07-05 0.0000 0.0000
... ...
2016-09-26 0.0000 0.0000
2016-09-27 0.0000 0.0000
2016-09-28 0.0000 0.0000
2016-09-29 0.0000 0.0000
2016-09-30 0.0000 0.0000
2016-10-01 0.0000 0.0000
2016-10-02 0.0000 0.0000
2016-10-03 0.0000 0.0000
2016-10-04 0.0000 0.0000
2016-10-05 0.0000 0.0000
2016-10-06 0.0000 0.0000
2016-10-07 0.0000 0.0000
2016-10-08 0.0000 0.0000
2016-10-09 0.0000 0.0000
2016-10-10 0.0000 0.0000
2016-10-11 0.0000 0.0000
2016-10-12 0.0000 0.0000
2016-10-13 0.0000 0.0000
2016-10-14 0.0000 0.0000
2016-10-15 0.0000 0.0000
2016-10-16 0.0000 0.0000
2016-10-17 0.0000 0.0000
2016-10-18 0.0000 0.0000
2016-10-19 0.0000 0.0000
2016-10-20 0.0000 0.0000
2016-10-21 0.0000 0.0000
2016-10-22 0.0000 0.0000
2016-10-23 0.0000 0.0000
2016-10-24 313.5954 364.4107
2016-10-25 362.4682 431.5803
[142 rows x 2 columns]
我正在尝试将缺失的日期添加到我的数据框中。
我看过这个帖子:
当我尝试重新索引我的数据框时:
print(df)
df = df.reindex(dates, fill_value=0)
print(df)
我得到以下输出:
_updated_at Name hour day date time data1 data2
06/06/2016 13:27 game_name 13 6 06/06/2016 evening 0 0
07/06/2016 10:33 game_name 10 7 07/06/2016 morning 145.2788 122.7361
18/10/2016 14:34 game_name 14 18 18/10/2016 evening 0 0
19/10/2016 17:12 game_name 17 19 19/10/2016 evening 0 0
24/10/2016 11:05 game_name 11 24 24/10/2016 morning 313.5954 364.4107
24/10/2016 12:02 game_name 12 24 24/10/2016 evening 0 0
25/10/2016 08:50 game_name 8 25 25/10/2016 morning 362.4682 431.5803
25/10/2016 13:00 game_name 13 25 25/10/2016 evening 0 0
_updated_at Name hour day date time data1 data2
24/10/2016 0 0 0 0 0 0 0
25/10/2016 0 0 0 0 0 0 0
26/10/2016 0 0 0 0 0 0 0
27/10/2016 0 0 0 0 0 0 0
28/10/2016 0 0 0 0 0 0 0
29/10/2016 0 0 0 0 0 0 0
30/10/2016 0 0 0 0 0 0 0
我希望看到缺少日期的行用新行和每个值中的 0 填充,而不是用 0 替换所有行。
编辑: 总体目标是能够计算出每天早晚差异的值之间的差异。
编辑2: 当前输出:
print (df.reindex(mux, fill_value=0).groupby(level=0)['data1'].diff(-1).dropna())
dtypes: float64(2)None
2016-06-06 morning 0.00000
2016-06-07 morning 440.99582
2016-06-08 morning 0.00000
2016-06-09 morning 0.00000
2016-06-10 morning 0.00000
print (df.reindex(mux, fill_value=0).groupby(level=0)['data2'].diff(-1).dropna())
Length: 142, dtype: float64
2016-06-06 morning -220.5481
2016-06-07 morning 0.0000
2016-06-08 morning 0.0000
2016-06-09 morning 0.0000
2016-06-10 morning 0.0000
2016-06-11 morning 0.0000
我期待看到 evening
值
您可以 reindex
by MultiIndex.from_product
从列 dates
和 time
:
df.date = pd.to_datetime(df.date)
dates = pd.date_range(start=df.date.min(), end=df.date.max())
print (dates)
DatetimeIndex(['2016-06-06', '2016-06-07', '2016-06-08', '2016-06-09',
'2016-06-10', '2016-06-11', '2016-06-12', '2016-06-13',
'2016-06-14', '2016-06-15',
...
'2016-10-16', '2016-10-17', '2016-10-18', '2016-10-19',
'2016-10-20', '2016-10-21', '2016-10-22', '2016-10-23',
'2016-10-24', '2016-10-25'],
dtype='datetime64[ns]', length=142, freq='D')
mux = pd.MultiIndex.from_product([dates,['morning','evening']])
#print (mux)
df.set_index(['date','time'], inplace=True)
print (df.reindex(mux, fill_value=0))
_updated_at Name hour day data1 data2
2016-06-06 morning 0 0 0 0 0.0000 0.0000
evening 06/06/2016 13:27 game_name 13 6 0.0000 0.0000
2016-06-07 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-08 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-09 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-10 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-11 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-12 morning 0 0 0 0 0.0000 0.0000
evening 0 0 0 0 0.0000 0.0000
2016-06-13 morning 0 0 0 0 0.0000 0.0000
...
最后你可以 groupby
by first level of Multiindex
(dates) with DataFrameGroupBy.diff
. You get for each dates row with NaN
which can be removed by dropna
:
print (df.reindex(mux, fill_value=0).groupby(level=0)['data1','data2'].diff(-1).dropna())
data1 data2
2016-06-06 morning 0.0000 0.0000
2016-06-07 morning 0.0000 0.0000
2016-06-08 morning 0.0000 0.0000
2016-06-09 morning 0.0000 0.0000
2016-06-10 morning 0.0000 0.0000
2016-06-11 morning 0.0000 0.0000
2016-06-12 morning 0.0000 0.0000
2016-06-13 morning 0.0000 0.0000
2016-06-14 morning 0.0000 0.0000
2016-06-15 morning 0.0000 0.0000
2016-06-16 morning 0.0000 0.0000
2016-06-17 morning 0.0000 0.0000
2016-06-18 morning 0.0000 0.0000
2016-06-19 morning 0.0000 0.0000
2016-06-20 morning 0.0000 0.0000
2016-06-21 morning 0.0000 0.0000
...
...
您还可以 select 通过 ix
并减去:
print (df.reindex(mux, fill_value=0)
.groupby(level=0)
.apply(lambda x: x.ix[0, ['data1','data2']]-x.ix[1, ['data1','data2']]))
data1 data2
2016-06-06 0.0000 0.0000
2016-06-07 0.0000 0.0000
2016-06-08 0.0000 0.0000
2016-06-09 0.0000 0.0000
2016-06-10 0.0000 0.0000
2016-06-11 0.0000 0.0000
2016-06-12 0.0000 0.0000
2016-06-13 0.0000 0.0000
2016-06-14 0.0000 0.0000
2016-06-15 0.0000 0.0000
2016-06-16 0.0000 0.0000
2016-06-17 0.0000 0.0000
2016-06-18 0.0000 0.0000
2016-06-19 0.0000 0.0000
2016-06-20 0.0000 0.0000
2016-06-21 0.0000 0.0000
2016-06-22 0.0000 0.0000
2016-06-23 0.0000 0.0000
2016-06-24 0.0000 0.0000
2016-06-25 0.0000 0.0000
2016-06-26 0.0000 0.0000
2016-06-27 0.0000 0.0000
2016-06-28 0.0000 0.0000
2016-06-29 0.0000 0.0000
2016-06-30 0.0000 0.0000
2016-07-01 0.0000 0.0000
2016-07-02 0.0000 0.0000
2016-07-03 0.0000 0.0000
2016-07-04 0.0000 0.0000
2016-07-05 0.0000 0.0000
... ...
2016-09-26 0.0000 0.0000
2016-09-27 0.0000 0.0000
2016-09-28 0.0000 0.0000
2016-09-29 0.0000 0.0000
2016-09-30 0.0000 0.0000
2016-10-01 0.0000 0.0000
2016-10-02 0.0000 0.0000
2016-10-03 0.0000 0.0000
2016-10-04 0.0000 0.0000
2016-10-05 0.0000 0.0000
2016-10-06 0.0000 0.0000
2016-10-07 0.0000 0.0000
2016-10-08 0.0000 0.0000
2016-10-09 0.0000 0.0000
2016-10-10 0.0000 0.0000
2016-10-11 0.0000 0.0000
2016-10-12 0.0000 0.0000
2016-10-13 0.0000 0.0000
2016-10-14 0.0000 0.0000
2016-10-15 0.0000 0.0000
2016-10-16 0.0000 0.0000
2016-10-17 0.0000 0.0000
2016-10-18 0.0000 0.0000
2016-10-19 0.0000 0.0000
2016-10-20 0.0000 0.0000
2016-10-21 0.0000 0.0000
2016-10-22 0.0000 0.0000
2016-10-23 0.0000 0.0000
2016-10-24 313.5954 364.4107
2016-10-25 362.4682 431.5803
[142 rows x 2 columns]