原点重采样='end_day'
resampling with origin='end_day'
我不明白 origin='end_day'
的作用。
docs举个例子:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7T, dtype: int32
>>> ts.resample('17min', origin='end_day').sum()
2000-10-01 23:38:00 3
2000-10-01 23:55:00 15
2000-10-02 00:12:00 45
2000-10-02 00:29:00 45
Freq: 17T, dtype: int32
文档这样解释 origin='end_day'
:
‘end_day’: origin is the ceiling midnight of the last day
据我了解,行
ts.resample('17min', origin='end_day').sum()
应该等同于
ts.resample('17min', origin=ts.index.max().ceil('1d')).sum()
但是,传递时间戳 ts.index.max().ceil('1d')
会产生不同的结果:
>>> ts.resample('17min', origin=ts.index.max().ceil('1d')).sum()
2000-10-01 23:21:00 3
2000-10-01 23:38:00 15
2000-10-01 23:55:00 27
2000-10-02 00:12:00 63
我正在寻找对这种差异的解释,也许比文档提供的 'end_day'
参数的一般描述更好。
编辑:我正在使用 pandas
1.3.5
origin='end_day'
的实际等价物是:
>>> ts.resample('17min', origin=ts.index.max().ceil('D'),
closed='right', label='right').sum()
2000-10-01 23:38:00 3
2000-10-01 23:55:00 15
2000-10-02 00:12:00 45
2000-10-02 00:29:00 45
Freq: 17T, dtype: int64
更新 1:
- What if I use origin='end_day' but also explicitly pass in closed and label not being 'right'? Where's the behavior defined for this?
来自 resample
的 source code:
# The backward resample sets ``closed`` to ``'right'`` by default
# since the last value should be considered as the edge point for
# the last bin. When origin in "end" or "end_day", the value for a
# specific ``Timestamp`` index stands for the resample result from
# the current ``Timestamp`` minus ``freq`` to the current
# ``Timestamp`` with a right close.
if origin in ["end", "end_day"]:
if closed is None:
closed = "right"
if label is None:
label = "right"
else:
if closed is None:
closed = "left"
if label is None:
label = "left"
更新 2a:
- Consider
df = pd.DataFrame(index=pd.date_range(start='2021-04-22 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(7))
. Now df.resample(rule='7d', origin='end_day')
crashes with a ValueError.
如果您没有明确设置 closed
参数,resample
将其设置为 right
因为 origin='end_day'
(见上文)。所以 origin
现在是 '2021-04-29' 并且第一个 bin 值是 '2021-04-22' 被排除在外。您遇到的情况是 Values falls before first bin
:
df = pd.DataFrame(index=pd.date_range(start='2021-04-22 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(7))
df.resample(rule='7d', origin='end_day', closed='left') # <- HERE
更新 2b:
If '2021-04-22' is the first bin, which timestamp does fall outside of it? '2021-04-22 01:00:00' is later, right?
df = pd.DataFrame(index=pd.date_range(start='2021-04-21 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(8))
print(df)
# Output:
0
2021-04-21 01:00:00 0
2021-04-22 01:00:00 1
2021-04-23 01:00:00 2
2021-04-24 01:00:00 3
2021-04-25 01:00:00 4
2021-04-26 01:00:00 5
2021-04-27 01:00:00 6
2021-04-28 01:00:00 7
有了这个例子,我想你应该更清楚了:
# closed='right' (default)
>>> df.resample(rule='7d', origin='end_day').sum()
0
2021-04-22 1 # ('2021-04-15', '2021-04-22']
2021-04-29 27 # ('2021-04-22', '2021-04-29']
# closed='left'
>>> df.resample(rule='7d', origin='end_day', closed='left').sum()
0
2021-04-22 0 # ['2021-04-15', '2021-04-22')
2021-04-29 28 # ['2021-04-22', '2021-04-29')
bin_edges
bin_edges
值为:
# closed='right' (default)
>>> bin_edges
[1618531199999999999 1619135999999999999 1619740799999999999]
# after conversion
DatetimeIndex(['2021-04-15 23:59:59.999999999',
'2021-04-22 23:59:59.999999999',
'2021-04-29 23:59:59.999999999'],
dtype='datetime64[ns]', freq=None)
# closed='left'
>>> bin_edges
[1618444800000000000 1619049600000000000 1619654400000000000]
# after conversion
DatetimeIndex(['2021-04-15',
'2021-04-22',
'2021-04-29'],
dtype='datetime64[ns]', freq=None)
我不明白 origin='end_day'
的作用。
docs举个例子:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7T, dtype: int32
>>> ts.resample('17min', origin='end_day').sum()
2000-10-01 23:38:00 3
2000-10-01 23:55:00 15
2000-10-02 00:12:00 45
2000-10-02 00:29:00 45
Freq: 17T, dtype: int32
文档这样解释 origin='end_day'
:
‘end_day’: origin is the ceiling midnight of the last day
据我了解,行
ts.resample('17min', origin='end_day').sum()
应该等同于
ts.resample('17min', origin=ts.index.max().ceil('1d')).sum()
但是,传递时间戳 ts.index.max().ceil('1d')
会产生不同的结果:
>>> ts.resample('17min', origin=ts.index.max().ceil('1d')).sum()
2000-10-01 23:21:00 3
2000-10-01 23:38:00 15
2000-10-01 23:55:00 27
2000-10-02 00:12:00 63
我正在寻找对这种差异的解释,也许比文档提供的 'end_day'
参数的一般描述更好。
编辑:我正在使用 pandas
1.3.5
origin='end_day'
的实际等价物是:
>>> ts.resample('17min', origin=ts.index.max().ceil('D'),
closed='right', label='right').sum()
2000-10-01 23:38:00 3
2000-10-01 23:55:00 15
2000-10-02 00:12:00 45
2000-10-02 00:29:00 45
Freq: 17T, dtype: int64
更新 1:
- What if I use origin='end_day' but also explicitly pass in closed and label not being 'right'? Where's the behavior defined for this?
来自 resample
的 source code:
# The backward resample sets ``closed`` to ``'right'`` by default
# since the last value should be considered as the edge point for
# the last bin. When origin in "end" or "end_day", the value for a
# specific ``Timestamp`` index stands for the resample result from
# the current ``Timestamp`` minus ``freq`` to the current
# ``Timestamp`` with a right close.
if origin in ["end", "end_day"]:
if closed is None:
closed = "right"
if label is None:
label = "right"
else:
if closed is None:
closed = "left"
if label is None:
label = "left"
更新 2a:
- Consider
df = pd.DataFrame(index=pd.date_range(start='2021-04-22 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(7))
. Nowdf.resample(rule='7d', origin='end_day')
crashes with a ValueError.
如果您没有明确设置 closed
参数,resample
将其设置为 right
因为 origin='end_day'
(见上文)。所以 origin
现在是 '2021-04-29' 并且第一个 bin 值是 '2021-04-22' 被排除在外。您遇到的情况是 Values falls before first bin
:
df = pd.DataFrame(index=pd.date_range(start='2021-04-22 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(7))
df.resample(rule='7d', origin='end_day', closed='left') # <- HERE
更新 2b:
If '2021-04-22' is the first bin, which timestamp does fall outside of it? '2021-04-22 01:00:00' is later, right?
df = pd.DataFrame(index=pd.date_range(start='2021-04-21 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(8))
print(df)
# Output:
0
2021-04-21 01:00:00 0
2021-04-22 01:00:00 1
2021-04-23 01:00:00 2
2021-04-24 01:00:00 3
2021-04-25 01:00:00 4
2021-04-26 01:00:00 5
2021-04-27 01:00:00 6
2021-04-28 01:00:00 7
有了这个例子,我想你应该更清楚了:
# closed='right' (default)
>>> df.resample(rule='7d', origin='end_day').sum()
0
2021-04-22 1 # ('2021-04-15', '2021-04-22']
2021-04-29 27 # ('2021-04-22', '2021-04-29']
# closed='left'
>>> df.resample(rule='7d', origin='end_day', closed='left').sum()
0
2021-04-22 0 # ['2021-04-15', '2021-04-22')
2021-04-29 28 # ['2021-04-22', '2021-04-29')
bin_edges
bin_edges
值为:
# closed='right' (default)
>>> bin_edges
[1618531199999999999 1619135999999999999 1619740799999999999]
# after conversion
DatetimeIndex(['2021-04-15 23:59:59.999999999',
'2021-04-22 23:59:59.999999999',
'2021-04-29 23:59:59.999999999'],
dtype='datetime64[ns]', freq=None)
# closed='left'
>>> bin_edges
[1618444800000000000 1619049600000000000 1619654400000000000]
# after conversion
DatetimeIndex(['2021-04-15',
'2021-04-22',
'2021-04-29'],
dtype='datetime64[ns]', freq=None)