Python 在某些情况下,使用 tzwhere 生成对象而不是 datetime64s 的时区确定和操作

Python timezone determination and manipulation with tzwhere produce objects rather than datetime64s in some situations

我有数百万行包含带有时区信息和 Latitude/Longitude 对的 UTC datetime64。对于每一行,我需要知道本地时区并创建一个包含本地时间的列。为此,我使用了 tzwhere 包。

说明问题的简单数据集:

TimeUTC,Latitude,Longitude
2021-10-11 12:16:00+00:00,42.289723,-71.031715
2021-10-11 12:16:00+00:00,0,0

我用来获取时区然后创建本地时间值的函数

def tz_from_location(row, tz):

    # Hardcoded in an effort to circumvent the problem. The returned value is still
    # an object, even though row.TimeUTC is a datetime64
    if (row.Latitude == 0) & (row.Longitude == 0):
        print ("0,0")
        ret_val = row.TimeUTC.tz_convert('UTC')
        return (row.TimeUTC)

    try:
        # forceTZ=True tells it to find the nearest timezone for places without one
        tzname = tz.tzNameAt(row.Latitude, row.Longitude, forceTZ=True)
        if (tzname == 'uninhabited'):
            return(row.TimeUTC)
            
        ret_val = row.TimeUTC.tz_convert(tzname)
#        ret_val = ret_val.to_pydatetime()
    except Exception as e:
        print(f'tz_from_location - Latitude: {row.Latitude} Longitude: {row.Longitude}')
        print(f'Error {e}')
        exit(-1)
        
    return(ret_val)

调用函数如下:

from tzwhere import tzwhere
from datetime import datetime

bug = pd.read_csv('./foo.csv')

# Initialize tzwhere
tz = tzwhere.tzwhere(forceTZ=True)

# Create the UTC column
bug['TimeUTC'] = bug['TimeUTC'].astype('datetime64[ns]')

# The original data comes in with a timezone that is of the local computer, not
# the location. Turn that into UTC
bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')

# Now call the function
bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)

# We are putting this into PostgreSQL. If you try to put a TZ aware datetime
# in it will automatically convert it to UTC. So, we need to make this value
# naive and then upload it
bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')

最后一行在包含 0,0 的行上引发错误,但在任何其他行上都没有。

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/8d/jp8b0rbx5rq0l8p8cnbb5k_r0000gn/T/ipykernel_49416/4161114700.py in <module>
      4 bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
      5 bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
----> 6 bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')

~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5459             or name in self._accessors
   5460         ):
-> 5461             return object.__getattribute__(self, name)
   5462         else:
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):

~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    178             # we're accessing the attribute of the class, i.e., Dataset.geo
    179             return self._accessor
--> 180         accessor_obj = self._accessor(obj)
    181         # Replace the property with the accessor object. Inspired by:
    182         # https://www.pydanny.com/cached-property.html

~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
    492             return PeriodProperties(data, orig)
    493 
--> 494         raise AttributeError("Can only use .dt accessor with datetimelike values")

AttributeError: Can only use .dt accessor with datetimelike values

这是因为第一行包含一个 datetime64 而第二行是一个对象。

这是调用前的 TimeUTC 值:

bug.TimeUTC
0   2021-10-11 12:16:00-04:00
1   2021-10-11 12:16:00-04:00
Name: TimeUTC, dtype: datetime64[ns, US/Eastern]

这是添加了 TimeLocal 的数据框:

bug.TimeLocal
0    2021-10-11 12:16:00-04:00
1    2021-10-11 12:16:00-04:00
Name: TimeLocal, dtype: object

如果您查看各个行,第一行是正确的,但第二行是一个对象。

我为 return 未显示为 0,0 行对象的所有努力都失败了。我确定我遗漏了一些简单的东西。

这里有一些建议;给出示例 DataFrame

                     TimeUTC   Latitude  Longitude
0  2021-10-11 12:16:00+00:00  42.289723 -71.031715
1  2021-10-11 12:16:00+00:00   0.000000   0.000000

确保将日期时间列解析为日期时间数据类型:

df['TimeUTC'] = pd.to_datetime(df['TimeUTC'])

然后,您可以重构从 lat/long 导出 tz 的函数,例如

from timezonefinder import TimezoneFinder

def tz_from_location(row, _tf=TimezoneFinder()):
    # if lat/lon aren't specified, we just want the existing name (e.g. UTC)
    if (row.Latitude == 0) & (row.Longitude == 0):
        return row.TimeUTC.tzname()
    # otherwise, try to find tz name
    tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
    if tzname: # return the name if it is not None
        return tzname
    return row.TimeUTC.tzname() # else return existing name

我建议使用 timezonefinder,因为我发现它更高效可靠 - docs, github.

现在您可以轻松应用和创建转换为 tz 的列:

df['TimeLocal'] = df.apply(lambda row: row['TimeUTC'].tz_convert(tz_from_location(row)), axis=1)

给你

                    TimeUTC   Latitude  Longitude                  TimeLocal
0 2021-10-11 12:16:00+00:00  42.289723 -71.031715  2021-10-11 08:16:00-04:00
1 2021-10-11 12:16:00+00:00   0.000000   0.000000  2021-10-11 12:16:00+00:00
df['TimeLocal'].iloc[0]
Out[2]: Timestamp('2021-10-11 08:16:00-0400', tz='America/New_York')

df['TimeLocal'].iloc[1]
Out[3]: Timestamp('2021-10-11 12:16:00+0000', tz='UTC')

(!) 但是... 因为您在 TimeLocal 列中有混合时区,所以 整个系列的数据类型将是 object - 没办法,这就是 pandas datetime 处理混合时区的方式一个系列。


附录

如果我们还想要一个带有时区名称的列,我们可以将函数 return 做成一个元组并在调用中使用 expand 以应用:

def convert_to_location_tz(row, _tf=TimezoneFinder()):
    # if lat/lon aren't specified, we just want the existing name (e.g. UTC)
    if (row.Latitude == 0) & (row.Longitude == 0):
        return (row.TimeUTC.tzname(), row.TimeUTC)
    # otherwise, try to find tz name
    tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
    if tzname: # return the name if it is not None
        return (tzname, row.TimeUTC.tz_convert(tzname))
    return (row.TimeUTC.tzname(), row.TimeUTC) # else return existing name

df[['tzname', 'TimeLocal']] = df.apply(lambda row: convert_to_location_tz(row), axis=1, result_type='expand')

df[['tzname', 'TimeLocal']]
Out[9]: 
             tzname                  TimeLocal
0  America/New_York  2021-10-11 08:16:00-04:00
1               UTC  2021-10-11 12:16:00+00:00