缺少数据范围 Pandas 数据帧比较 Python
Missing data range Pandas dataframe comparison Python
我如何能够编写输出 dates
和 data
之间差异的代码。 data
代码中缺少数据点,而 1 分钟时间范围内的 dates
数据帧中存在跳跃。例如,在 2015-10-08 13:53:00
之后,有 6 个数据点缺失,因此它打印为 '2015-10-08 13:54:00', '2015-10-08 14:00:00'
输出缺失范围 data
。将它记录在Expected Output
中的二维数组中。我如何才能编写出产生预期输出的函数。
import pandas as pd
import datetime
dates = pd.date_range("2015-10-08 13:40:00", "2015-10-08 14:12:00", freq="1min")
data = pd.to_datetime(['2015-10-08 13:41:00',
'2015-10-08 13:42:00', '2015-10-08 13:43:00',
'2015-10-08 13:44:00', '2015-10-08 13:45:00',
'2015-10-08 13:46:00', '2015-10-08 13:47:00',
'2015-10-08 13:48:00', '2015-10-08 13:49:00',
'2015-10-08 13:50:00', '2015-10-08 13:51:00',
'2015-10-08 13:52:00', '2015-10-08 13:53:00',
'2015-10-08 13:54:00', '2015-10-08 14:01:00',
'2015-10-08 14:02:00', '2015-10-08 14:03:00',
'2015-10-08 14:04:00', '2015-10-08 14:05:00',
'2015-10-08 14:06:00', '2015-10-08 14:07:00',
'2015-10-08 14:10:00', '2015-10-08 14:11:00',
'2015-10-08 14:12:00'])
预期输出:
[['2015-10-08 13:40:00'],
['2015-10-08 13:54:00', '2015-10-08 14:00:00'],
['2015-10-08 14:08:00', '2015-10-08 14:09:00']]
dates
和 data
都是日期时间索引。您可以使用 pd.Index.difference
来区分它们
In [55]: s = pd.Series(dates.difference(data))
...: s # sort if needed
Out[55]:
0 2015-10-08 13:40:00
1 2015-10-08 13:55:00
2 2015-10-08 13:56:00
3 2015-10-08 13:57:00
4 2015-10-08 13:58:00
5 2015-10-08 13:59:00
6 2015-10-08 14:00:00
7 2015-10-08 14:08:00
8 2015-10-08 14:09:00
dtype: datetime64[ns]
In [56]: groups_diff_ne_1min = s.diff().fillna(pd.Timedelta(seconds=60)) != pd.Timedelta(seconds=60)
...: groups_diff_ne_1min
Out[56]:
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
In [57]: groups = groups_diff_ne_1min.cumsum()
...: groups
Out[57]:
0 0
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int64
In [58]: s.groupby(groups).agg(['first', 'last'])
Out[58]:
first last
0 2015-10-08 13:40:00 2015-10-08 13:40:00
1 2015-10-08 13:55:00 2015-10-08 14:00:00
2 2015-10-08 14:08:00 2015-10-08 14:09:00
我如何能够编写输出 dates
和 data
之间差异的代码。 data
代码中缺少数据点,而 1 分钟时间范围内的 dates
数据帧中存在跳跃。例如,在 2015-10-08 13:53:00
之后,有 6 个数据点缺失,因此它打印为 '2015-10-08 13:54:00', '2015-10-08 14:00:00'
输出缺失范围 data
。将它记录在Expected Output
中的二维数组中。我如何才能编写出产生预期输出的函数。
import pandas as pd
import datetime
dates = pd.date_range("2015-10-08 13:40:00", "2015-10-08 14:12:00", freq="1min")
data = pd.to_datetime(['2015-10-08 13:41:00',
'2015-10-08 13:42:00', '2015-10-08 13:43:00',
'2015-10-08 13:44:00', '2015-10-08 13:45:00',
'2015-10-08 13:46:00', '2015-10-08 13:47:00',
'2015-10-08 13:48:00', '2015-10-08 13:49:00',
'2015-10-08 13:50:00', '2015-10-08 13:51:00',
'2015-10-08 13:52:00', '2015-10-08 13:53:00',
'2015-10-08 13:54:00', '2015-10-08 14:01:00',
'2015-10-08 14:02:00', '2015-10-08 14:03:00',
'2015-10-08 14:04:00', '2015-10-08 14:05:00',
'2015-10-08 14:06:00', '2015-10-08 14:07:00',
'2015-10-08 14:10:00', '2015-10-08 14:11:00',
'2015-10-08 14:12:00'])
预期输出:
[['2015-10-08 13:40:00'],
['2015-10-08 13:54:00', '2015-10-08 14:00:00'],
['2015-10-08 14:08:00', '2015-10-08 14:09:00']]
dates
和 data
都是日期时间索引。您可以使用 pd.Index.difference
In [55]: s = pd.Series(dates.difference(data))
...: s # sort if needed
Out[55]:
0 2015-10-08 13:40:00
1 2015-10-08 13:55:00
2 2015-10-08 13:56:00
3 2015-10-08 13:57:00
4 2015-10-08 13:58:00
5 2015-10-08 13:59:00
6 2015-10-08 14:00:00
7 2015-10-08 14:08:00
8 2015-10-08 14:09:00
dtype: datetime64[ns]
In [56]: groups_diff_ne_1min = s.diff().fillna(pd.Timedelta(seconds=60)) != pd.Timedelta(seconds=60)
...: groups_diff_ne_1min
Out[56]:
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
In [57]: groups = groups_diff_ne_1min.cumsum()
...: groups
Out[57]:
0 0
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int64
In [58]: s.groupby(groups).agg(['first', 'last'])
Out[58]:
first last
0 2015-10-08 13:40:00 2015-10-08 13:40:00
1 2015-10-08 13:55:00 2015-10-08 14:00:00
2 2015-10-08 14:08:00 2015-10-08 14:09:00