Python 在某些情况下,使用 tzwhere 生成对象而不是 datetime64s 的时区确定和操作
Python timezone determination and manipulation with tzwhere produce objects rather than datetime64s in some situations
我有数百万行包含带有时区信息和 Latitude/Longitude 对的 UTC datetime64。对于每一行,我需要知道本地时区并创建一个包含本地时间的列。为此,我使用了 tzwhere 包。
说明问题的简单数据集:
TimeUTC,Latitude,Longitude
2021-10-11 12:16:00+00:00,42.289723,-71.031715
2021-10-11 12:16:00+00:00,0,0
我用来获取时区然后创建本地时间值的函数
def tz_from_location(row, tz):
# Hardcoded in an effort to circumvent the problem. The returned value is still
# an object, even though row.TimeUTC is a datetime64
if (row.Latitude == 0) & (row.Longitude == 0):
print ("0,0")
ret_val = row.TimeUTC.tz_convert('UTC')
return (row.TimeUTC)
try:
# forceTZ=True tells it to find the nearest timezone for places without one
tzname = tz.tzNameAt(row.Latitude, row.Longitude, forceTZ=True)
if (tzname == 'uninhabited'):
return(row.TimeUTC)
ret_val = row.TimeUTC.tz_convert(tzname)
# ret_val = ret_val.to_pydatetime()
except Exception as e:
print(f'tz_from_location - Latitude: {row.Latitude} Longitude: {row.Longitude}')
print(f'Error {e}')
exit(-1)
return(ret_val)
调用函数如下:
from tzwhere import tzwhere
from datetime import datetime
bug = pd.read_csv('./foo.csv')
# Initialize tzwhere
tz = tzwhere.tzwhere(forceTZ=True)
# Create the UTC column
bug['TimeUTC'] = bug['TimeUTC'].astype('datetime64[ns]')
# The original data comes in with a timezone that is of the local computer, not
# the location. Turn that into UTC
bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
# Now call the function
bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
# We are putting this into PostgreSQL. If you try to put a TZ aware datetime
# in it will automatically convert it to UTC. So, we need to make this value
# naive and then upload it
bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')
最后一行在包含 0,0 的行上引发错误,但在任何其他行上都没有。
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/8d/jp8b0rbx5rq0l8p8cnbb5k_r0000gn/T/ipykernel_49416/4161114700.py in <module>
4 bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
5 bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
----> 6 bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
5459 or name in self._accessors
5460 ):
-> 5461 return object.__getattribute__(self, name)
5462 else:
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
178 # we're accessing the attribute of the class, i.e., Dataset.geo
179 return self._accessor
--> 180 accessor_obj = self._accessor(obj)
181 # Replace the property with the accessor object. Inspired by:
182 # https://www.pydanny.com/cached-property.html
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
492 return PeriodProperties(data, orig)
493
--> 494 raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values
这是因为第一行包含一个 datetime64 而第二行是一个对象。
这是调用前的 TimeUTC 值:
bug.TimeUTC
0 2021-10-11 12:16:00-04:00
1 2021-10-11 12:16:00-04:00
Name: TimeUTC, dtype: datetime64[ns, US/Eastern]
这是添加了 TimeLocal 的数据框:
bug.TimeLocal
0 2021-10-11 12:16:00-04:00
1 2021-10-11 12:16:00-04:00
Name: TimeLocal, dtype: object
如果您查看各个行,第一行是正确的,但第二行是一个对象。
我为 return 未显示为 0,0 行对象的所有努力都失败了。我确定我遗漏了一些简单的东西。
这里有一些建议;给出示例 DataFrame
TimeUTC Latitude Longitude
0 2021-10-11 12:16:00+00:00 42.289723 -71.031715
1 2021-10-11 12:16:00+00:00 0.000000 0.000000
确保将日期时间列解析为日期时间数据类型:
df['TimeUTC'] = pd.to_datetime(df['TimeUTC'])
然后,您可以重构从 lat/long 导出 tz 的函数,例如
from timezonefinder import TimezoneFinder
def tz_from_location(row, _tf=TimezoneFinder()):
# if lat/lon aren't specified, we just want the existing name (e.g. UTC)
if (row.Latitude == 0) & (row.Longitude == 0):
return row.TimeUTC.tzname()
# otherwise, try to find tz name
tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
if tzname: # return the name if it is not None
return tzname
return row.TimeUTC.tzname() # else return existing name
我建议使用 timezonefinder
,因为我发现它更高效可靠 - docs, github.
现在您可以轻松应用和创建转换为 tz 的列:
df['TimeLocal'] = df.apply(lambda row: row['TimeUTC'].tz_convert(tz_from_location(row)), axis=1)
给你
TimeUTC Latitude Longitude TimeLocal
0 2021-10-11 12:16:00+00:00 42.289723 -71.031715 2021-10-11 08:16:00-04:00
1 2021-10-11 12:16:00+00:00 0.000000 0.000000 2021-10-11 12:16:00+00:00
df['TimeLocal'].iloc[0]
Out[2]: Timestamp('2021-10-11 08:16:00-0400', tz='America/New_York')
df['TimeLocal'].iloc[1]
Out[3]: Timestamp('2021-10-11 12:16:00+0000', tz='UTC')
(!) 但是... 因为您在 TimeLocal
列中有混合时区,所以 整个系列的数据类型将是 object
- 没办法,这就是 pandas datetime 处理混合时区的方式一个系列。
附录
如果我们还想要一个带有时区名称的列,我们可以将函数 return 做成一个元组并在调用中使用 expand 以应用:
def convert_to_location_tz(row, _tf=TimezoneFinder()):
# if lat/lon aren't specified, we just want the existing name (e.g. UTC)
if (row.Latitude == 0) & (row.Longitude == 0):
return (row.TimeUTC.tzname(), row.TimeUTC)
# otherwise, try to find tz name
tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
if tzname: # return the name if it is not None
return (tzname, row.TimeUTC.tz_convert(tzname))
return (row.TimeUTC.tzname(), row.TimeUTC) # else return existing name
df[['tzname', 'TimeLocal']] = df.apply(lambda row: convert_to_location_tz(row), axis=1, result_type='expand')
df[['tzname', 'TimeLocal']]
Out[9]:
tzname TimeLocal
0 America/New_York 2021-10-11 08:16:00-04:00
1 UTC 2021-10-11 12:16:00+00:00
我有数百万行包含带有时区信息和 Latitude/Longitude 对的 UTC datetime64。对于每一行,我需要知道本地时区并创建一个包含本地时间的列。为此,我使用了 tzwhere 包。
说明问题的简单数据集:
TimeUTC,Latitude,Longitude
2021-10-11 12:16:00+00:00,42.289723,-71.031715
2021-10-11 12:16:00+00:00,0,0
我用来获取时区然后创建本地时间值的函数
def tz_from_location(row, tz):
# Hardcoded in an effort to circumvent the problem. The returned value is still
# an object, even though row.TimeUTC is a datetime64
if (row.Latitude == 0) & (row.Longitude == 0):
print ("0,0")
ret_val = row.TimeUTC.tz_convert('UTC')
return (row.TimeUTC)
try:
# forceTZ=True tells it to find the nearest timezone for places without one
tzname = tz.tzNameAt(row.Latitude, row.Longitude, forceTZ=True)
if (tzname == 'uninhabited'):
return(row.TimeUTC)
ret_val = row.TimeUTC.tz_convert(tzname)
# ret_val = ret_val.to_pydatetime()
except Exception as e:
print(f'tz_from_location - Latitude: {row.Latitude} Longitude: {row.Longitude}')
print(f'Error {e}')
exit(-1)
return(ret_val)
调用函数如下:
from tzwhere import tzwhere
from datetime import datetime
bug = pd.read_csv('./foo.csv')
# Initialize tzwhere
tz = tzwhere.tzwhere(forceTZ=True)
# Create the UTC column
bug['TimeUTC'] = bug['TimeUTC'].astype('datetime64[ns]')
# The original data comes in with a timezone that is of the local computer, not
# the location. Turn that into UTC
bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
# Now call the function
bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
# We are putting this into PostgreSQL. If you try to put a TZ aware datetime
# in it will automatically convert it to UTC. So, we need to make this value
# naive and then upload it
bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')
最后一行在包含 0,0 的行上引发错误,但在任何其他行上都没有。
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/8d/jp8b0rbx5rq0l8p8cnbb5k_r0000gn/T/ipykernel_49416/4161114700.py in <module>
4 bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
5 bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
----> 6 bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
5459 or name in self._accessors
5460 ):
-> 5461 return object.__getattribute__(self, name)
5462 else:
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
178 # we're accessing the attribute of the class, i.e., Dataset.geo
179 return self._accessor
--> 180 accessor_obj = self._accessor(obj)
181 # Replace the property with the accessor object. Inspired by:
182 # https://www.pydanny.com/cached-property.html
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
492 return PeriodProperties(data, orig)
493
--> 494 raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values
这是因为第一行包含一个 datetime64 而第二行是一个对象。
这是调用前的 TimeUTC 值:
bug.TimeUTC
0 2021-10-11 12:16:00-04:00
1 2021-10-11 12:16:00-04:00
Name: TimeUTC, dtype: datetime64[ns, US/Eastern]
这是添加了 TimeLocal 的数据框:
bug.TimeLocal
0 2021-10-11 12:16:00-04:00
1 2021-10-11 12:16:00-04:00
Name: TimeLocal, dtype: object
如果您查看各个行,第一行是正确的,但第二行是一个对象。
我为 return 未显示为 0,0 行对象的所有努力都失败了。我确定我遗漏了一些简单的东西。
这里有一些建议;给出示例 DataFrame
TimeUTC Latitude Longitude
0 2021-10-11 12:16:00+00:00 42.289723 -71.031715
1 2021-10-11 12:16:00+00:00 0.000000 0.000000
确保将日期时间列解析为日期时间数据类型:
df['TimeUTC'] = pd.to_datetime(df['TimeUTC'])
然后,您可以重构从 lat/long 导出 tz 的函数,例如
from timezonefinder import TimezoneFinder
def tz_from_location(row, _tf=TimezoneFinder()):
# if lat/lon aren't specified, we just want the existing name (e.g. UTC)
if (row.Latitude == 0) & (row.Longitude == 0):
return row.TimeUTC.tzname()
# otherwise, try to find tz name
tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
if tzname: # return the name if it is not None
return tzname
return row.TimeUTC.tzname() # else return existing name
我建议使用 timezonefinder
,因为我发现它更高效可靠 - docs, github.
现在您可以轻松应用和创建转换为 tz 的列:
df['TimeLocal'] = df.apply(lambda row: row['TimeUTC'].tz_convert(tz_from_location(row)), axis=1)
给你
TimeUTC Latitude Longitude TimeLocal
0 2021-10-11 12:16:00+00:00 42.289723 -71.031715 2021-10-11 08:16:00-04:00
1 2021-10-11 12:16:00+00:00 0.000000 0.000000 2021-10-11 12:16:00+00:00
df['TimeLocal'].iloc[0]
Out[2]: Timestamp('2021-10-11 08:16:00-0400', tz='America/New_York')
df['TimeLocal'].iloc[1]
Out[3]: Timestamp('2021-10-11 12:16:00+0000', tz='UTC')
(!) 但是... 因为您在 TimeLocal
列中有混合时区,所以 整个系列的数据类型将是 object
- 没办法,这就是 pandas datetime 处理混合时区的方式一个系列。
附录
如果我们还想要一个带有时区名称的列,我们可以将函数 return 做成一个元组并在调用中使用 expand 以应用:
def convert_to_location_tz(row, _tf=TimezoneFinder()):
# if lat/lon aren't specified, we just want the existing name (e.g. UTC)
if (row.Latitude == 0) & (row.Longitude == 0):
return (row.TimeUTC.tzname(), row.TimeUTC)
# otherwise, try to find tz name
tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
if tzname: # return the name if it is not None
return (tzname, row.TimeUTC.tz_convert(tzname))
return (row.TimeUTC.tzname(), row.TimeUTC) # else return existing name
df[['tzname', 'TimeLocal']] = df.apply(lambda row: convert_to_location_tz(row), axis=1, result_type='expand')
df[['tzname', 'TimeLocal']]
Out[9]:
tzname TimeLocal
0 America/New_York 2021-10-11 08:16:00-04:00
1 UTC 2021-10-11 12:16:00+00:00