用 NaN 填充 datetimeindex 间隙
Fill datetimeindex gap by NaN
我有两个日期时间索引的数据帧。一个缺少其中一些日期时间 (df1
),而另一个是完整的(具有规则的时间戳,在这个系列中没有任何间隙)并且充满了 NaN
的 (df2
)。
我正在尝试将 df1 中的值与 df2
的索引相匹配,用 NaN
填充 df1
中不存在这样的 datetimeindex
=].
示例:
In [51]: df1
Out [51]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-03-01 00:00:00 6
2015-03-01 01:00:00 37
2015-03-01 02:00:00 56
2015-03-01 03:00:00 12
2015-03-01 04:00:00 41
2015-03-01 05:00:00 31
... ...
2018-12-25 23:00:00 41
<34843 rows × 1 columns>
In [52]: df2 = pd.DataFrame(data=None, index=pd.date_range(freq='60Min', start=df1.index.min(), end=df1.index.max()))
df2['value']=np.NaN
df2
Out [52]: value
2015-01-01 14:00:00 NaN
2015-01-01 15:00:00 NaN
2015-01-01 16:00:00 NaN
2015-01-01 17:00:00 NaN
2015-01-01 18:00:00 NaN
2015-01-01 19:00:00 NaN
2015-01-01 20:00:00 NaN
2015-01-01 21:00:00 NaN
2015-01-01 22:00:00 NaN
2015-01-01 23:00:00 NaN
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 NaN
<34906 rows × 1 columns>
使用 df2.combine_first(df1)
returns 与 df1.reindex(index= df2.index)
相同的数据,它填补了任何不应该有数据的空白,而不是 NaN。
In [53]: Result = df2.combine_first(df1)
Result
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 35
2015-01-02 01:00:00 53
2015-01-02 02:00:00 28
2015-01-02 03:00:00 48
2015-01-02 04:00:00 42
2015-01-02 05:00:00 51
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
这就是我希望得到的:
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
有人可以阐明为什么会发生这种情况,以及如何设置这些值的填充方式吗?
IIUC 你需要 resample
df1
,因为你有一个不规则 frequency
并且你需要常规频率:
print df1.index.freq
None
print Result.index.freq
<60 * Minutes>
EDIT1
您可以使用函数 asfreq
instead of resample
- doc, resample vs asfreq
.
EDIT2
首先我认为 resample
不起作用,因为重采样后 Result
与 df1
相同。但我尝试 print df1.info()
和 print Result.info()
得到不同的结果 - 34857 entries
与 34920 entries
。
所以我尝试查找具有 NaN
值的行,它 returns 63 rows
.
所以我觉得resample
效果不错。
import pandas as pd
df1 = pd.read_csv('test/GapInTimestamps.csv', sep=",", index_col=[0], parse_dates=[0])
print df1.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print df1.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34857 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Data columns (total 1 columns):
#value 34857 non-null int64
#dtypes: int64(1)
#memory usage: 544.6 KB
#None
Result = df1.resample('60min')
print Result.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print Result.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34920 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Freq: 60T
#Data columns (total 1 columns):
#value 34857 non-null float64
#dtypes: float64(1)
#memory usage: 545.6 KB
#None
#find values with NaN
resultnan = Result[Result.isnull().any(axis=1)]
#temporaly display 999 rows and 15 columns
with pd.option_context('display.max_rows', 999, 'display.max_columns', 15):
print resultnan
# value
#Date/Time
#2015-01-13 19:00:00 NaN
#2015-01-13 20:00:00 NaN
#2015-01-13 21:00:00 NaN
#2015-01-13 22:00:00 NaN
#2015-01-13 23:00:00 NaN
#2015-01-14 00:00:00 NaN
#2015-01-14 01:00:00 NaN
#2015-01-14 02:00:00 NaN
#2015-01-14 03:00:00 NaN
#2015-01-14 04:00:00 NaN
#2015-01-14 05:00:00 NaN
#2015-01-14 06:00:00 NaN
#2015-01-14 07:00:00 NaN
#2015-01-14 08:00:00 NaN
#2015-01-14 09:00:00 NaN
#2015-02-01 00:00:00 NaN
#2015-02-01 01:00:00 NaN
#2015-02-01 02:00:00 NaN
#2015-02-01 03:00:00 NaN
#2015-02-01 04:00:00 NaN
#2015-02-01 05:00:00 NaN
#2015-02-01 06:00:00 NaN
#2015-02-01 07:00:00 NaN
#2015-02-01 08:00:00 NaN
#2015-02-01 09:00:00 NaN
#2015-02-01 10:00:00 NaN
#2015-02-01 11:00:00 NaN
#2015-02-01 12:00:00 NaN
#2015-02-01 13:00:00 NaN
#2015-02-01 14:00:00 NaN
#2015-02-01 15:00:00 NaN
#2015-02-01 16:00:00 NaN
#2015-02-01 17:00:00 NaN
#2015-02-01 18:00:00 NaN
#2015-02-01 19:00:00 NaN
#2015-02-01 20:00:00 NaN
#2015-02-01 21:00:00 NaN
#2015-02-01 22:00:00 NaN
#2015-02-01 23:00:00 NaN
#2015-11-01 00:00:00 NaN
#2015-11-01 01:00:00 NaN
#2015-11-01 02:00:00 NaN
#2015-11-01 03:00:00 NaN
#2015-11-01 04:00:00 NaN
#2015-11-01 05:00:00 NaN
#2015-11-01 06:00:00 NaN
#2015-11-01 07:00:00 NaN
#2015-11-01 08:00:00 NaN
#2015-11-01 09:00:00 NaN
#2015-11-01 10:00:00 NaN
#2015-11-01 11:00:00 NaN
#2015-11-01 12:00:00 NaN
#2015-11-01 13:00:00 NaN
#2015-11-01 14:00:00 NaN
#2015-11-01 15:00:00 NaN
#2015-11-01 16:00:00 NaN
#2015-11-01 17:00:00 NaN
#2015-11-01 18:00:00 NaN
#2015-11-01 19:00:00 NaN
#2015-11-01 20:00:00 NaN
#2015-11-01 21:00:00 NaN
#2015-11-01 22:00:00 NaN
#2015-11-01 23:00:00 NaN
我有两个日期时间索引的数据帧。一个缺少其中一些日期时间 (df1
),而另一个是完整的(具有规则的时间戳,在这个系列中没有任何间隙)并且充满了 NaN
的 (df2
)。
我正在尝试将 df1 中的值与 df2
的索引相匹配,用 NaN
填充 df1
中不存在这样的 datetimeindex
=].
示例:
In [51]: df1
Out [51]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-03-01 00:00:00 6
2015-03-01 01:00:00 37
2015-03-01 02:00:00 56
2015-03-01 03:00:00 12
2015-03-01 04:00:00 41
2015-03-01 05:00:00 31
... ...
2018-12-25 23:00:00 41
<34843 rows × 1 columns>
In [52]: df2 = pd.DataFrame(data=None, index=pd.date_range(freq='60Min', start=df1.index.min(), end=df1.index.max()))
df2['value']=np.NaN
df2
Out [52]: value
2015-01-01 14:00:00 NaN
2015-01-01 15:00:00 NaN
2015-01-01 16:00:00 NaN
2015-01-01 17:00:00 NaN
2015-01-01 18:00:00 NaN
2015-01-01 19:00:00 NaN
2015-01-01 20:00:00 NaN
2015-01-01 21:00:00 NaN
2015-01-01 22:00:00 NaN
2015-01-01 23:00:00 NaN
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 NaN
<34906 rows × 1 columns>
使用 df2.combine_first(df1)
returns 与 df1.reindex(index= df2.index)
相同的数据,它填补了任何不应该有数据的空白,而不是 NaN。
In [53]: Result = df2.combine_first(df1)
Result
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 35
2015-01-02 01:00:00 53
2015-01-02 02:00:00 28
2015-01-02 03:00:00 48
2015-01-02 04:00:00 42
2015-01-02 05:00:00 51
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
这就是我希望得到的:
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
有人可以阐明为什么会发生这种情况,以及如何设置这些值的填充方式吗?
IIUC 你需要 resample
df1
,因为你有一个不规则 frequency
并且你需要常规频率:
print df1.index.freq
None
print Result.index.freq
<60 * Minutes>
EDIT1
您可以使用函数 asfreq
instead of resample
- doc, resample vs asfreq
.
EDIT2
首先我认为 resample
不起作用,因为重采样后 Result
与 df1
相同。但我尝试 print df1.info()
和 print Result.info()
得到不同的结果 - 34857 entries
与 34920 entries
。
所以我尝试查找具有 NaN
值的行,它 returns 63 rows
.
所以我觉得resample
效果不错。
import pandas as pd
df1 = pd.read_csv('test/GapInTimestamps.csv', sep=",", index_col=[0], parse_dates=[0])
print df1.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print df1.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34857 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Data columns (total 1 columns):
#value 34857 non-null int64
#dtypes: int64(1)
#memory usage: 544.6 KB
#None
Result = df1.resample('60min')
print Result.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print Result.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34920 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Freq: 60T
#Data columns (total 1 columns):
#value 34857 non-null float64
#dtypes: float64(1)
#memory usage: 545.6 KB
#None
#find values with NaN
resultnan = Result[Result.isnull().any(axis=1)]
#temporaly display 999 rows and 15 columns
with pd.option_context('display.max_rows', 999, 'display.max_columns', 15):
print resultnan
# value
#Date/Time
#2015-01-13 19:00:00 NaN
#2015-01-13 20:00:00 NaN
#2015-01-13 21:00:00 NaN
#2015-01-13 22:00:00 NaN
#2015-01-13 23:00:00 NaN
#2015-01-14 00:00:00 NaN
#2015-01-14 01:00:00 NaN
#2015-01-14 02:00:00 NaN
#2015-01-14 03:00:00 NaN
#2015-01-14 04:00:00 NaN
#2015-01-14 05:00:00 NaN
#2015-01-14 06:00:00 NaN
#2015-01-14 07:00:00 NaN
#2015-01-14 08:00:00 NaN
#2015-01-14 09:00:00 NaN
#2015-02-01 00:00:00 NaN
#2015-02-01 01:00:00 NaN
#2015-02-01 02:00:00 NaN
#2015-02-01 03:00:00 NaN
#2015-02-01 04:00:00 NaN
#2015-02-01 05:00:00 NaN
#2015-02-01 06:00:00 NaN
#2015-02-01 07:00:00 NaN
#2015-02-01 08:00:00 NaN
#2015-02-01 09:00:00 NaN
#2015-02-01 10:00:00 NaN
#2015-02-01 11:00:00 NaN
#2015-02-01 12:00:00 NaN
#2015-02-01 13:00:00 NaN
#2015-02-01 14:00:00 NaN
#2015-02-01 15:00:00 NaN
#2015-02-01 16:00:00 NaN
#2015-02-01 17:00:00 NaN
#2015-02-01 18:00:00 NaN
#2015-02-01 19:00:00 NaN
#2015-02-01 20:00:00 NaN
#2015-02-01 21:00:00 NaN
#2015-02-01 22:00:00 NaN
#2015-02-01 23:00:00 NaN
#2015-11-01 00:00:00 NaN
#2015-11-01 01:00:00 NaN
#2015-11-01 02:00:00 NaN
#2015-11-01 03:00:00 NaN
#2015-11-01 04:00:00 NaN
#2015-11-01 05:00:00 NaN
#2015-11-01 06:00:00 NaN
#2015-11-01 07:00:00 NaN
#2015-11-01 08:00:00 NaN
#2015-11-01 09:00:00 NaN
#2015-11-01 10:00:00 NaN
#2015-11-01 11:00:00 NaN
#2015-11-01 12:00:00 NaN
#2015-11-01 13:00:00 NaN
#2015-11-01 14:00:00 NaN
#2015-11-01 15:00:00 NaN
#2015-11-01 16:00:00 NaN
#2015-11-01 17:00:00 NaN
#2015-11-01 18:00:00 NaN
#2015-11-01 19:00:00 NaN
#2015-11-01 20:00:00 NaN
#2015-11-01 21:00:00 NaN
#2015-11-01 22:00:00 NaN
#2015-11-01 23:00:00 NaN