Pandas 告诉我非歧义时间是歧义的

Pandas tells me non-ambiguous time is ambiguous

我有以下测试代码:

import pandas as pd

dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
              'value': [3, 4, 5]})

当使用 pandas 1.1.5 版时,它运行成功。但是在 pandas 版本 1.2.5 或 1.3.4 下,它失败并出现以下错误:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    'value': [3, 4, 5]})
  File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
    arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
  File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
  File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
    val, index, dtype=dtype, copy=False, raise_cast_failure=False
  File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
    data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
  File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
    subarr = cls._from_sequence([value] * length, dtype=dtype)
  File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
    return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
  File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
    ambiguous=ambiguous,
  File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
    data.view("i8"), tz, ambiguous=ambiguous
  File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument

我知道 11 月 7 日是夏令时。但是这个数据对我来说看起来很明确,并且完全本地化;为什么 pandas 忘记了它的时区信息,为什么它拒绝把它放在 DataFrame 中?这里有某种解决方法吗?

更新:

我记得我实际上在几个月前就此提交了一个错误,但直到本周我们开始在生产中看到实际的 DST 转换日期时,我们才对它有一定的学术兴趣: https://github.com/pandas-dev/pandas/issues/42505

这是不明确的,因为有 2 个日期有这个特殊时间:有夏令时和没有夏令时:

# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
      .tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)


# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
      .tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)

解决方法

dt = pd.to_datetime('2021-11-07 01:00:00-0400')

df = pd.DataFrame({'datetime': dt,
                   'value': [3, 4, 5]})

df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')

我接受了@Corralien 的回答,我也想展示我最终决定采用的解决方法:

# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# 
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
    data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)

is_array()is_scalar() 函数检查 x 是否是任何 set, list, tuple, np.ndarray, pd.Series, pd.Index.

的实例

它并不完美,但希望管道胶带能一直保持到可以在 Pandas 中修复为止。