如何使用 Pandas Timestamp fold 参数?
How to use Pandas Timestamp fold argument?
在处理时区转换和夏令时影响时,我很难弄清楚 Pandas 构造函数的 fold
参数的实现。 documentation 提到:
Due to daylight saving time, one wall clock time can occur twice when shifting from summer to winter time; fold describes whether the datetime-like corresponds to the first (0) or the second time (1) the wall clock hits the ambiguous time.
到目前为止没有惊喜,但是当我运行下面的代码时:
import pandas as pd
from datetime import datetime
pre_fold = pd.Timestamp(datetime(2022,10,30,1,30,0), tz="CET")
in_fold_fold0 = pd.Timestamp(datetime(2022,10,30,2,30,0), tz="CET")
in_fold_fold1 = pd.Timestamp(datetime(2022,10,30,2,30,0), tz="CET", fold=1)
post_fold = pd.Timestamp(datetime(2022,10,30,3,30,0), tz="CET")
print(f"fold0: {in_fold_fold0.fold}")
print(f"fold1: {in_fold_fold1.fold}")
print(f"Pre CET fold: {pre_fold} -> UTC {pre_fold.tz_convert(tz='UTC')}")
print(f"In CET fold, fold0: {in_fold_fold0} -> UTC {in_fold_fold0.tz_convert(tz='UTC')}")
print(f"In CET fold, fold1: {in_fold_fold1} -> UTC {in_fold_fold1.tz_convert(tz='UTC')}")
print(f"Post CET fold: {post_fold} -> UTC {post_fold.tz_convert(tz='UTC')}")
输出不符合预期:
fold0: 0
fold1: 1
Pre CET fold: 2022-10-30 01:30:00+02:00 -> UTC 2022-10-29 23:30:00+00:00
In CET fold, fold0: 2022-10-30 02:30:00+01:00 -> UTC 2022-10-30 01:30:00+00:00
In CET fold, fold1: 2022-10-30 02:30:00+01:00 -> UTC 2022-10-30 01:30:00+00:00
Post CET fold: 2022-10-30 03:30:00+01:00 -> UTC 2022-10-30 02:30:00+00:00
第 4 行应该是:
In CET fold, fold0: 2022-10-30 02:30:00+02:00 -> UTC 2022-10-30 00:30:00+00:00
我在这里错过了什么?
PS: 使用 python 的 datetime
对象会产生预期的输出:
from datetime import datetime
from dateutil import tz
dt_pre_fold = datetime(2022,10,30,1,30,0, tzinfo=tz.gettz("CET"))
dt_in_fold_fold0 = datetime(2022,10,30,2,30,0, tzinfo=tz.gettz("CET"))
dt_in_fold_fold1 = datetime(2022,10,30,2,30,0, tzinfo=tz.gettz("CET"), fold=1)
dt_post_fold = datetime(2022,10,30,3,30,0, tzinfo=tz.gettz("CET"))
print(f"Pre CET fold: {dt_pre_fold} -> UTC {dt_pre_fold.astimezone(tz.gettz('UTC'))}")
print(f"In CET fold, fold0: {dt_in_fold_fold0} -> UTC {dt_in_fold_fold0.astimezone(tz.gettz('UTC'))}")
print(f"In CET fold, fold1: {dt_in_fold_fold1} -> UTC {dt_in_fold_fold1.astimezone(tz.gettz('UTC'))}")
print(f"Post CET fold: {dt_post_fold} -> UTC {dt_post_fold.astimezone(tz.gettz('UTC'))}")
输出:
Pre CET fold: 2022-10-30 01:30:00+02:00 -> UTC 2022-10-29 23:30:00+00:00
In CET fold, fold0: 2022-10-30 02:30:00+02:00 -> UTC 2022-10-30 00:30:00+00:00
In CET fold, fold1: 2022-10-30 02:30:00+01:00 -> UTC 2022-10-30 01:30:00+00:00
Post CET fold: 2022-10-30 03:30:00+01:00 -> UTC 2022-10-30 02:30:00+00:00
似乎没有正确指定时区信息:
# using your code
x = pd.Timestamp(datetime(2022,10,30,2,30,0), fold = 0, tz="CET")
x.tz_convert('UTC')
# Timestamp('2022-10-30 01:30:00+0000', tz='UTC')
但是如果你使用from dateutil import tz
x = pd.Timestamp(datetime(2022,10,30,2,30,0), fold = 0, tz=tz.gettz("CET"))
x.tz_convert('UTC')
# Timestamp('2022-10-30 00:30:00+0000', tz='UTC')
它returns正确的值
这种绕过问题但不是使用 'fold',您可以将时间戳本地化到某个时区并使用 ambiguous
关键字来指定它应该是 DST 还是non-DST次,从docs:
ambiguous [...]
bool-ndarray where True signifies a DST time, False signifies a non-DST time (note that this flag is only applicable for ambiguous times)
所以你可以做你想做的事
import pandas as pd
f0 = pd.Timestamp("2022-10-30 02:30:00").tz_localize("Europe/Berlin", ambiguous=True)
f1 = pd.Timestamp("2022-10-30 02:30:00").tz_localize("Europe/Berlin", ambiguous=False)
print(f0.tz_convert('UTC'))
print(f1.tz_convert('UTC'))
# 2022-10-30 00:30:00+00:00 # was DST, UTC+2
# 2022-10-30 01:30:00+00:00 # was non-DST, UTC+1
至于为什么 'fold' 不能正常工作,我认为这可以追溯到 pandas
仍在内部使用 pytz
进行时区处理的事实。 pytz 不支持'fold',而是使用关键字'is_dst'。您可以找到更多信息,例如在 this blog post by Paul Ganssle. There's also some hints deep in the pandas src。相比之下,dateutil
的时区 确实 支持 'fold',这可能是 @ZLi 的解决方案有效的原因。
旁注:
- 最好使用实际的 IANA time zone names 以避免缩写可能产生的歧义
- 不要混合原生 Python 日期时间和 pandas' 日期时间,以避免原生 Python 日期时间
的一些粗糙边缘
在处理时区转换和夏令时影响时,我很难弄清楚 Pandas 构造函数的 fold
参数的实现。 documentation 提到:
Due to daylight saving time, one wall clock time can occur twice when shifting from summer to winter time; fold describes whether the datetime-like corresponds to the first (0) or the second time (1) the wall clock hits the ambiguous time.
到目前为止没有惊喜,但是当我运行下面的代码时:
import pandas as pd
from datetime import datetime
pre_fold = pd.Timestamp(datetime(2022,10,30,1,30,0), tz="CET")
in_fold_fold0 = pd.Timestamp(datetime(2022,10,30,2,30,0), tz="CET")
in_fold_fold1 = pd.Timestamp(datetime(2022,10,30,2,30,0), tz="CET", fold=1)
post_fold = pd.Timestamp(datetime(2022,10,30,3,30,0), tz="CET")
print(f"fold0: {in_fold_fold0.fold}")
print(f"fold1: {in_fold_fold1.fold}")
print(f"Pre CET fold: {pre_fold} -> UTC {pre_fold.tz_convert(tz='UTC')}")
print(f"In CET fold, fold0: {in_fold_fold0} -> UTC {in_fold_fold0.tz_convert(tz='UTC')}")
print(f"In CET fold, fold1: {in_fold_fold1} -> UTC {in_fold_fold1.tz_convert(tz='UTC')}")
print(f"Post CET fold: {post_fold} -> UTC {post_fold.tz_convert(tz='UTC')}")
输出不符合预期:
fold0: 0
fold1: 1
Pre CET fold: 2022-10-30 01:30:00+02:00 -> UTC 2022-10-29 23:30:00+00:00
In CET fold, fold0: 2022-10-30 02:30:00+01:00 -> UTC 2022-10-30 01:30:00+00:00
In CET fold, fold1: 2022-10-30 02:30:00+01:00 -> UTC 2022-10-30 01:30:00+00:00
Post CET fold: 2022-10-30 03:30:00+01:00 -> UTC 2022-10-30 02:30:00+00:00
第 4 行应该是:
In CET fold, fold0: 2022-10-30 02:30:00+02:00 -> UTC 2022-10-30 00:30:00+00:00
我在这里错过了什么?
PS: 使用 python 的 datetime
对象会产生预期的输出:
from datetime import datetime
from dateutil import tz
dt_pre_fold = datetime(2022,10,30,1,30,0, tzinfo=tz.gettz("CET"))
dt_in_fold_fold0 = datetime(2022,10,30,2,30,0, tzinfo=tz.gettz("CET"))
dt_in_fold_fold1 = datetime(2022,10,30,2,30,0, tzinfo=tz.gettz("CET"), fold=1)
dt_post_fold = datetime(2022,10,30,3,30,0, tzinfo=tz.gettz("CET"))
print(f"Pre CET fold: {dt_pre_fold} -> UTC {dt_pre_fold.astimezone(tz.gettz('UTC'))}")
print(f"In CET fold, fold0: {dt_in_fold_fold0} -> UTC {dt_in_fold_fold0.astimezone(tz.gettz('UTC'))}")
print(f"In CET fold, fold1: {dt_in_fold_fold1} -> UTC {dt_in_fold_fold1.astimezone(tz.gettz('UTC'))}")
print(f"Post CET fold: {dt_post_fold} -> UTC {dt_post_fold.astimezone(tz.gettz('UTC'))}")
输出:
Pre CET fold: 2022-10-30 01:30:00+02:00 -> UTC 2022-10-29 23:30:00+00:00
In CET fold, fold0: 2022-10-30 02:30:00+02:00 -> UTC 2022-10-30 00:30:00+00:00
In CET fold, fold1: 2022-10-30 02:30:00+01:00 -> UTC 2022-10-30 01:30:00+00:00
Post CET fold: 2022-10-30 03:30:00+01:00 -> UTC 2022-10-30 02:30:00+00:00
似乎没有正确指定时区信息:
# using your code
x = pd.Timestamp(datetime(2022,10,30,2,30,0), fold = 0, tz="CET")
x.tz_convert('UTC')
# Timestamp('2022-10-30 01:30:00+0000', tz='UTC')
但是如果你使用from dateutil import tz
x = pd.Timestamp(datetime(2022,10,30,2,30,0), fold = 0, tz=tz.gettz("CET"))
x.tz_convert('UTC')
# Timestamp('2022-10-30 00:30:00+0000', tz='UTC')
它returns正确的值
这种绕过问题但不是使用 'fold',您可以将时间戳本地化到某个时区并使用 ambiguous
关键字来指定它应该是 DST 还是non-DST次,从docs:
ambiguous [...] bool-ndarray where True signifies a DST time, False signifies a non-DST time (note that this flag is only applicable for ambiguous times)
所以你可以做你想做的事
import pandas as pd
f0 = pd.Timestamp("2022-10-30 02:30:00").tz_localize("Europe/Berlin", ambiguous=True)
f1 = pd.Timestamp("2022-10-30 02:30:00").tz_localize("Europe/Berlin", ambiguous=False)
print(f0.tz_convert('UTC'))
print(f1.tz_convert('UTC'))
# 2022-10-30 00:30:00+00:00 # was DST, UTC+2
# 2022-10-30 01:30:00+00:00 # was non-DST, UTC+1
至于为什么 'fold' 不能正常工作,我认为这可以追溯到 pandas
仍在内部使用 pytz
进行时区处理的事实。 pytz 不支持'fold',而是使用关键字'is_dst'。您可以找到更多信息,例如在 this blog post by Paul Ganssle. There's also some hints deep in the pandas src。相比之下,dateutil
的时区 确实 支持 'fold',这可能是 @ZLi 的解决方案有效的原因。
旁注:
- 最好使用实际的 IANA time zone names 以避免缩写可能产生的歧义
- 不要混合原生 Python 日期时间和 pandas' 日期时间,以避免原生 Python 日期时间 的一些粗糙边缘