从 CET / CEST 到 UTC 的时间序列转换

TimeSeries conversion from CET / CEST to UTC

我有两个时间序列文件,它们应该在 CET / CEST 中。其中不好的一个,没有以正确的方式写入值。对于好的 csv,请参见此处:

#test_good.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7224
2017-10-29 02:00,7225
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...

...使用以下一切正常:

        df['utc_time'] = pd.to_datetime(df[local_time_column])
                            .dt.tz_localize('CET', ambiguous="infer")
                            .dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')

将 test_bad.csv 转换为 UTC 时,出现 AmbiguousTimeError,因为缺少 10 月的 2 个小时。

# test_bad.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017   # everything is as it should be
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7223
2017-10-29 02:00,7224   # the value of 2 am should actually be repeated PLUS 3 am is missing
2017-10-29 04:00,7226
2017-10-29 05:00,7227
...

有谁知道如何仍然将时间序列文件转换为 UTC 并为新索引中缺失的日期添加 NaN 列的优雅方法?感谢您的帮助。

详细阐述 Mark Ransom 的评论;

2017-10-29 02:00,7224 

有歧义;它可以是 2017-10-29 00:00 UTC 2017-10-29 01:00 UTC。这就是为什么 pd.to_datetime 拒绝推断任何东西。

在一些本地人 Python 的帮助下,您可以变通。假设您只是将 csv 加载到 df 而没有将任何内容解析到日期时间,您可以继续

from datetime import datetime
import pytz

df['local_time'] = [pytz.timezone('Europe/Berlin').localize(datetime.fromisoformat(t)) for t in df['local_time']]

# so you can make a UTC index:
df.set_index(df['local_time'].dt.tz_convert('UTC'), inplace=True)

# Now you can create a new, hourly index from that and re-index:
dti = pd.date_range(df.index[0], df.index[-1], freq='H')
df2 = df.reindex(dti)

# for comparison, the "re-created" local_time column:
df2['local_time'] = df2.index.tz_convert('Europe/Berlin').strftime('%Y-%m-%d %H:%M:%S').values

应该会给你类似的东西

df2
                            value           local_time
2017-03-25 23:00:00+00:00  2016.0  2017-03-26 00:00:00
2017-03-26 00:00:00+00:00  2017.0  2017-03-26 01:00:00
2017-03-26 01:00:00+00:00  2018.0  2017-03-26 03:00:00
2017-03-26 02:00:00+00:00  2019.0  2017-03-26 04:00:00
2017-03-26 03:00:00+00:00     NaN  2017-03-26 05:00:00
                          ...                  ...
2017-10-29 00:00:00+00:00     NaN  2017-10-29 02:00:00
2017-10-29 01:00:00+00:00  7224.0  2017-10-29 02:00:00 # note: value randomly attributed to "second" 2 am
2017-10-29 02:00:00+00:00     NaN  2017-10-29 03:00:00
2017-10-29 03:00:00+00:00  7226.0  2017-10-29 04:00:00
2017-10-29 04:00:00+00:00  7227.0  2017-10-29 05:00:00

如上文所述,值 7224 归因于 2017-10-29 01:00:00 UTC,但它也可以归因于 2017-10-29 00:00:00 UTC 如果您不在乎,那没关系。如果这是一个问题,我认为你能做的最好的事情就是放弃这个价值。您可以使用

df['local_time'] = pd.to_datetime(df['local_time']).dt.tz_localize('Europe/Berlin', ambiguous='NaT')

而不是上面代码中的原生 Python 部分。

只是为了提供我用于此解决方法的解决方案: 它使用了一些 try: except: 功能来应对时间不明确的错误。一方面,这应该将时间向量转换为 UTC,同时它还通过 reindexin 填充缺失值。欢迎提出改进建议。

try: # here everything is as expected and one hour is missing in Mar and one hour is repeated in Oct

# Localize tz-naive index of the DataFrame to target time zone.
df['time'] = df.iloc[:,0].dt.tz_localize('CET', ambiguous='infer').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
df = df.set_index(pd.to_datetime(df['time'], utc=True))

# Create a complete time vector in UTC for latter reindexing
idx = pd.date_range(df.index.min(), df.index.max(), freq=freq, tz='UTC')

# Verify that time vector is complete
if len(np.unique(np.diff(df.index))) == 1:
    print('Time vector is complete!')
else:
    # print dates which are not in the sequence and add them while simultaneously adding NaNs to the data columns
    print(f'These dates are not in the sequence:{idx.difference(df["utc_time"])}')
    df = df.reindex(idx).rename_axis('time')
    
except pytz.exceptions.AmbiguousTimeError: # here python does not know how to handle the non-reapeated time

# create the localized datetime column with a list comprehension
df['time'] = [pytz.timezone('Europe/Berlin').localize(t, is_dst=True) for t in df.iloc[:, 0]]

# make an UTC index:
df.set_index(df['time'].dt.tz_convert('UTC'), inplace=True)

# create a new index of desired frequency from that and re-index:
idx = pd.date_range(df.index[0], df.index[-1], freq=freq, tz='UTC')


# Verify that time vector is complete
if len(np.unique(np.diff(df.index))) == 1:
    print('Time vector is complete!')
else:
    # print dates which are not in the sequence and add them while simultaneously adding NaNs to the data columns
    print(f'These were the dates which were not in the sequence:{pd.Series(idx.difference(df["time"]))}')
    df = df.reindex(idx).rename_axis('time')