将“pandas”频率字符串转换为“DateOffset”
Convert `pandas` frequency string to `DateOffset`
我有一个时区感知 pandas
DateTimeIndex
,我想提前一个时间步长,时间步长由其 .freq
属性指定。但是,这样做不尊重时区信息:
import pandas as pd
i = pd.date_range('2020-03-28', freq='D', periods=3, tz='Europe/Amsterdam')
# DatetimeIndex(['2020-03-28 00:00:00+01:00', '2020-03-29 00:00:00+01:00',
# '2020-03-30 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
i + i.freq
# Not what I want; second timestamp is advanced by 24h instead of 23h and is no longer at midnight:
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 01:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
的工作是使用 pd.DateOffset
:
i + pd.DateOffset(days=1)
# What I want; all timestamps at midnight (I just need to re-set the .freq attribute):
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
但是,由于我事先不知道索引的频率是多少,我想使用 i.freq
的值来获得正确的 DateOffset
。有没有办法做到这一点? (除了使用长 if... elif... elif...
块。)
当然也欢迎其他解决方案。
This 是我发现的唯一与此相关的其他问题,但我不能在这里使用它:
i + pd.tseries.frequencies.to_offset(i.freq)
# Not what I want:
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 01:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
(其实后一项returns正好i.freq
。)
非常感谢。
编辑 (1)
正如评论中所建议的那样,在某些情况下使用 .shift(1)
是可行的,包括我在上面提到的情况...
i.shift(1)
# What I want; all timestamps at midnight:
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
...但不是全部。事实上,将我原始索引中的开始日期提前一天会导致时间戳被删除,其余的都是错误的:
i2 = pd.date_range('2020-03-29', freq='D', periods=3, tz='Europe/Amsterdam')
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
i2.shift(1)
# Not what I want: timestamps not at midnight, and one got dropped!
# DatetimeIndex(['2020-03-30 01:00:00+02:00', '2020-03-31 01:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
编辑 (2)
正如@MrFruppes 在回答中所建议的那样,使用 i.freq
的 .nanos
属性作为 pd.DateOffset
...
的输入
i + pd.DateOffset(nanoseconds=i.freq.nanos)
# What I want; all timestamps at midnight (I just need to re-set the .freq attribute):
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
...但是当我们尝试提前到下月初时它会中断:
i3 = pd.date_range('2020-03-01', freq='MS', periods=3, tz='Europe/Amsterdam')
# DatetimeIndex(['2020-03-01 00:00:00+01:00', '2020-04-01 00:00:00+02:00',
# '2020-05-01 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='MS')
i3 + pd.DateOffset(nanoseconds=i3.freq.nanos)
Traceback (most recent call last):
File "<ipython-input-58-f3a32c654a6e>", line 1, in <module>
i3 + pd.DateOffset(nanoseconds=i3.freq.nanos)
File "pandas\_libs\tslibs\offsets.pyx", line 690, in pandas._libs.tslibs.offsets.BaseOffset.nanos.__get__
ValueError: <MonthBegin> is a non-fixed frequency
如果你有固定的频率,可以使用nanos
属性的频率。例如:
import pandas as pd
i = pd.date_range('2020-03-29', freq='D', periods=3, tz='Europe/Amsterdam')
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
i + pd.DateOffset(nanoseconds=i.freq.nanos)
# DatetimeIndex(['2020-03-30 00:00:00+02:00', '2020-03-31 00:00:00+02:00',
# '2020-04-01 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
pd.DateOffset
也不是普遍适用的。这是我目前拥有的,它通过了我所有的单元测试,但我愿意改进:
if i.tz is None:
raise AttributeError("Index is missing timezone information.")
# Get right timestamp for each index value, based on the frequency.
# . This one breaks for 'MS':
# i + pd.DateOffset(nanoseconds=i.freq.nanos)
# . This drops a value at some DST transitions:
# i.shift(1)
# . This one gives wrong value at DST transitions:
# i + i.freq
if i.freq == "15T": # period length always the same
ts_right = i + pd.Timedelta(hours=0.25)
elif i.freq == "H": # period length always the same
ts_right = i + pd.Timedelta(hours=1)
else: # period length dependent on calendar
if i.freq == "D":
kwargs = {"days": 1}
elif i.freq == "MS":
kwargs = {"months": 1}
elif i.freq == "QS":
kwargs = {"months": 3}
elif i.freq == "AS":
kwargs = {"years": 1}
else:
raise ValueError(f"Invalid frequency: {i.freq}.")
ts_right = i + pd.DateOffset(**kwargs)
(我只实现了与我的用例相关的 .freq
值。)
我有一个时区感知 pandas
DateTimeIndex
,我想提前一个时间步长,时间步长由其 .freq
属性指定。但是,这样做不尊重时区信息:
import pandas as pd
i = pd.date_range('2020-03-28', freq='D', periods=3, tz='Europe/Amsterdam')
# DatetimeIndex(['2020-03-28 00:00:00+01:00', '2020-03-29 00:00:00+01:00',
# '2020-03-30 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
i + i.freq
# Not what I want; second timestamp is advanced by 24h instead of 23h and is no longer at midnight:
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 01:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
的工作是使用 pd.DateOffset
:
i + pd.DateOffset(days=1)
# What I want; all timestamps at midnight (I just need to re-set the .freq attribute):
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
但是,由于我事先不知道索引的频率是多少,我想使用 i.freq
的值来获得正确的 DateOffset
。有没有办法做到这一点? (除了使用长 if... elif... elif...
块。)
当然也欢迎其他解决方案。
This 是我发现的唯一与此相关的其他问题,但我不能在这里使用它:
i + pd.tseries.frequencies.to_offset(i.freq)
# Not what I want:
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 01:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
(其实后一项returns正好i.freq
。)
非常感谢。
编辑 (1)
正如评论中所建议的那样,在某些情况下使用 .shift(1)
是可行的,包括我在上面提到的情况...
i.shift(1)
# What I want; all timestamps at midnight:
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
...但不是全部。事实上,将我原始索引中的开始日期提前一天会导致时间戳被删除,其余的都是错误的:
i2 = pd.date_range('2020-03-29', freq='D', periods=3, tz='Europe/Amsterdam')
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
i2.shift(1)
# Not what I want: timestamps not at midnight, and one got dropped!
# DatetimeIndex(['2020-03-30 01:00:00+02:00', '2020-03-31 01:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
编辑 (2)
正如@MrFruppes 在回答中所建议的那样,使用 i.freq
的 .nanos
属性作为 pd.DateOffset
...
i + pd.DateOffset(nanoseconds=i.freq.nanos)
# What I want; all timestamps at midnight (I just need to re-set the .freq attribute):
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
...但是当我们尝试提前到下月初时它会中断:
i3 = pd.date_range('2020-03-01', freq='MS', periods=3, tz='Europe/Amsterdam')
# DatetimeIndex(['2020-03-01 00:00:00+01:00', '2020-04-01 00:00:00+02:00',
# '2020-05-01 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='MS')
i3 + pd.DateOffset(nanoseconds=i3.freq.nanos)
Traceback (most recent call last):
File "<ipython-input-58-f3a32c654a6e>", line 1, in <module>
i3 + pd.DateOffset(nanoseconds=i3.freq.nanos)
File "pandas\_libs\tslibs\offsets.pyx", line 690, in pandas._libs.tslibs.offsets.BaseOffset.nanos.__get__
ValueError: <MonthBegin> is a non-fixed frequency
如果你有固定的频率,可以使用nanos
属性的频率。例如:
import pandas as pd
i = pd.date_range('2020-03-29', freq='D', periods=3, tz='Europe/Amsterdam')
# DatetimeIndex(['2020-03-29 00:00:00+01:00', '2020-03-30 00:00:00+02:00',
# '2020-03-31 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq='D')
i + pd.DateOffset(nanoseconds=i.freq.nanos)
# DatetimeIndex(['2020-03-30 00:00:00+02:00', '2020-03-31 00:00:00+02:00',
# '2020-04-01 00:00:00+02:00'],
# dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
pd.DateOffset
也不是普遍适用的。这是我目前拥有的,它通过了我所有的单元测试,但我愿意改进:
if i.tz is None:
raise AttributeError("Index is missing timezone information.")
# Get right timestamp for each index value, based on the frequency.
# . This one breaks for 'MS':
# i + pd.DateOffset(nanoseconds=i.freq.nanos)
# . This drops a value at some DST transitions:
# i.shift(1)
# . This one gives wrong value at DST transitions:
# i + i.freq
if i.freq == "15T": # period length always the same
ts_right = i + pd.Timedelta(hours=0.25)
elif i.freq == "H": # period length always the same
ts_right = i + pd.Timedelta(hours=1)
else: # period length dependent on calendar
if i.freq == "D":
kwargs = {"days": 1}
elif i.freq == "MS":
kwargs = {"months": 1}
elif i.freq == "QS":
kwargs = {"months": 3}
elif i.freq == "AS":
kwargs = {"years": 1}
else:
raise ValueError(f"Invalid frequency: {i.freq}.")
ts_right = i + pd.DateOffset(**kwargs)
(我只实现了与我的用例相关的 .freq
值。)