Pandas 使用单独的时区列转换日期时间
Pandas convert datetime with a separate time zone column
我有一个包含时区列和日期时间列的数据框。我想先将这些转换为 UTC 以与其他数据合并,然后我将进行一些计算以最终从 UTC 转换为观众本地时区。
datetime time_zone
2016-09-19 01:29:13 America/Bogota
2016-09-19 02:16:04 America/New_York
2016-09-19 01:57:54 Africa/Cairo
def create_utc(df, column, time_format='%Y-%m-%d %H:%M:%S'):
timezone = df['TZ']
df[column + '_utc'] = df[column].dt.tz_localize(timezone).dt.tz_convert('UTC').dt.strftime(time_format)
df[column + '_utc'].replace('NaT', np.nan, inplace=True)
df[column + '_utc'] = pd.to_datetime(df[column + '_utc'])
return df
那是我有缺陷的尝试。错误是真相是不明确的,这是有道理的,因为 'timezone' 变量指的是一列。如何引用同一行中的值?
编辑:以下是基于一天数据(394,000 行和 22 个唯一时区)的以下答案的一些结果。 Edit2:我添加了一个 groupby 示例,以防有人想要查看结果。这是迄今为止最快的。
%%timeit
for tz in df['TZ'].unique():
df.ix[df['TZ'] == tz, 'datetime_utc2'] = df.ix[df['TZ'] == tz, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
df['datetime_utc2'] = df['datetime_utc2'].dt.tz_localize(None)
1 loops, best of 3: 1.27 s per loop
%%timeit
df['datetime_utc'] = [d['datetime'].tz_localize(d['TZ']).tz_convert('UTC') for i, d in df.iterrows()]
df['datetime_utc'] = df['datetime_utc'].dt.tz_localize(None)
1 loops, best of 3: 50.3 s per loop
df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('TZ')])
**1 loops, best of 3: 249 ms per loop**
您的问题是 tz_localize()
只能采用标量值,因此我们必须遍历 DataFrame:
df['datetime_utc'] = [d['datetime'].tz_localize(d['time_zone']).tz_convert('UTC') for i,d in df.iterrows()]
结果是:
datetime time_zone datetime_utc
0 2016-09-19 01:29:13 America/Bogota 2016-09-19 06:29:13+00:00
1 2016-09-19 02:16:04 America/New_York 2016-09-19 06:16:04+00:00
2 2016-09-19 01:57:54 Africa/Cairo 2016-09-18 23:57:54+00:00
另一种方法是按时区分组并一次转换所有匹配的行:
df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('time_zone')])
这是一个向量化的方法(它将循环 df.time_zone.nunique()
次):
In [2]: t
Out[2]:
datetime time_zone
0 2016-09-19 01:29:13 America/Bogota
1 2016-09-19 02:16:04 America/New_York
2 2016-09-19 01:57:54 Africa/Cairo
3 2016-09-19 11:00:00 America/Bogota
4 2016-09-19 12:00:00 America/New_York
5 2016-09-19 13:00:00 Africa/Cairo
In [3]: for tz in t.time_zone.unique():
...: mask = (t.time_zone == tz)
...: t.loc[mask, 'datetime'] = \
...: t.loc[mask, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
...:
In [4]: t
Out[4]:
datetime time_zone
0 2016-09-19 06:29:13 America/Bogota
1 2016-09-19 06:16:04 America/New_York
2 2016-09-18 23:57:54 Africa/Cairo
3 2016-09-19 16:00:00 America/Bogota
4 2016-09-19 16:00:00 America/New_York
5 2016-09-19 11:00:00 Africa/Cairo
更新:
In [12]: df['new'] = df.groupby('time_zone')['datetime'] \
.transform(lambda x: x.dt.tz_localize(x.name))
In [13]: df
Out[13]:
datetime time_zone new
0 2016-09-19 01:29:13 America/Bogota 2016-09-19 06:29:13
1 2016-09-19 02:16:04 America/New_York 2016-09-19 06:16:04
2 2016-09-19 01:57:54 Africa/Cairo 2016-09-18 23:57:54
3 2016-09-19 11:00:00 America/Bogota 2016-09-19 16:00:00
4 2016-09-19 12:00:00 America/New_York 2016-09-19 16:00:00
5 2016-09-19 13:00:00 Africa/Cairo 2016-09-19 11:00:00
我有一个包含时区列和日期时间列的数据框。我想先将这些转换为 UTC 以与其他数据合并,然后我将进行一些计算以最终从 UTC 转换为观众本地时区。
datetime time_zone
2016-09-19 01:29:13 America/Bogota
2016-09-19 02:16:04 America/New_York
2016-09-19 01:57:54 Africa/Cairo
def create_utc(df, column, time_format='%Y-%m-%d %H:%M:%S'):
timezone = df['TZ']
df[column + '_utc'] = df[column].dt.tz_localize(timezone).dt.tz_convert('UTC').dt.strftime(time_format)
df[column + '_utc'].replace('NaT', np.nan, inplace=True)
df[column + '_utc'] = pd.to_datetime(df[column + '_utc'])
return df
那是我有缺陷的尝试。错误是真相是不明确的,这是有道理的,因为 'timezone' 变量指的是一列。如何引用同一行中的值?
编辑:以下是基于一天数据(394,000 行和 22 个唯一时区)的以下答案的一些结果。 Edit2:我添加了一个 groupby 示例,以防有人想要查看结果。这是迄今为止最快的。
%%timeit
for tz in df['TZ'].unique():
df.ix[df['TZ'] == tz, 'datetime_utc2'] = df.ix[df['TZ'] == tz, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
df['datetime_utc2'] = df['datetime_utc2'].dt.tz_localize(None)
1 loops, best of 3: 1.27 s per loop
%%timeit
df['datetime_utc'] = [d['datetime'].tz_localize(d['TZ']).tz_convert('UTC') for i, d in df.iterrows()]
df['datetime_utc'] = df['datetime_utc'].dt.tz_localize(None)
1 loops, best of 3: 50.3 s per loop
df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('TZ')])
**1 loops, best of 3: 249 ms per loop**
您的问题是 tz_localize()
只能采用标量值,因此我们必须遍历 DataFrame:
df['datetime_utc'] = [d['datetime'].tz_localize(d['time_zone']).tz_convert('UTC') for i,d in df.iterrows()]
结果是:
datetime time_zone datetime_utc
0 2016-09-19 01:29:13 America/Bogota 2016-09-19 06:29:13+00:00
1 2016-09-19 02:16:04 America/New_York 2016-09-19 06:16:04+00:00
2 2016-09-19 01:57:54 Africa/Cairo 2016-09-18 23:57:54+00:00
另一种方法是按时区分组并一次转换所有匹配的行:
df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('time_zone')])
这是一个向量化的方法(它将循环 df.time_zone.nunique()
次):
In [2]: t
Out[2]:
datetime time_zone
0 2016-09-19 01:29:13 America/Bogota
1 2016-09-19 02:16:04 America/New_York
2 2016-09-19 01:57:54 Africa/Cairo
3 2016-09-19 11:00:00 America/Bogota
4 2016-09-19 12:00:00 America/New_York
5 2016-09-19 13:00:00 Africa/Cairo
In [3]: for tz in t.time_zone.unique():
...: mask = (t.time_zone == tz)
...: t.loc[mask, 'datetime'] = \
...: t.loc[mask, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
...:
In [4]: t
Out[4]:
datetime time_zone
0 2016-09-19 06:29:13 America/Bogota
1 2016-09-19 06:16:04 America/New_York
2 2016-09-18 23:57:54 Africa/Cairo
3 2016-09-19 16:00:00 America/Bogota
4 2016-09-19 16:00:00 America/New_York
5 2016-09-19 11:00:00 Africa/Cairo
更新:
In [12]: df['new'] = df.groupby('time_zone')['datetime'] \
.transform(lambda x: x.dt.tz_localize(x.name))
In [13]: df
Out[13]:
datetime time_zone new
0 2016-09-19 01:29:13 America/Bogota 2016-09-19 06:29:13
1 2016-09-19 02:16:04 America/New_York 2016-09-19 06:16:04
2 2016-09-19 01:57:54 Africa/Cairo 2016-09-18 23:57:54
3 2016-09-19 11:00:00 America/Bogota 2016-09-19 16:00:00
4 2016-09-19 12:00:00 America/New_York 2016-09-19 16:00:00
5 2016-09-19 13:00:00 Africa/Cairo 2016-09-19 11:00:00