pandas to_Datetime 使用时区感知索引进行转换
pandas to_Datetime conversion with timezone aware index
我有一个具有时区感知索引的数据框
>>> dfn.index
Out[1]:
DatetimeIndex(['2004-01-02 01:00:00+11:00', '2004-01-02 02:00:00+11:00',
'2004-01-02 03:00:00+11:00', '2004-01-02 04:00:00+11:00',
'2004-01-02 21:00:00+11:00', '2004-01-02 22:00:00+11:00'],
dtype='datetime64[ns]', freq='H', tz='Australia/Sydney')
我保存在csv中,然后读取如下:
>>> dfn.to_csv('temp.csv')
>>> df= pd.read_csv('temp.csv', index_col=0 ,header=None )
>>> df.head()
Out[1]:
1
0
NaN 0.0000
2004-01-02 01:00:00+11:00 0.7519
2004-01-02 02:00:00+11:00 0.7520
2004-01-02 03:00:00+11:00 0.7515
2004-01-02 04:00:00+11:00 0.7502
索引被读取为字符串
>>> df.index[1]
Out[3]: '2004-01-02 01:00:00+11:00'
在转换 to_datetime 时,它会更改时间,因为它会将小时数增加 +11
>>> df.index = pd.to_datetime(df.index)
>>> df.index[1]
Out[6]: Timestamp('2004-01-01 14:00:00')
我现在可以从索引中减去 11 小时来修复它,但是有更好的方法来处理这个问题吗?
我尝试使用答案 中的解决方案,但这会大大降低代码速度。
我认为这是一个问题,您需要以相同的方式写入和读取文件头。
并且对于解析日期需要参数 parse_dates
.
#write to file header
dfn.to_csv('temp.csv')
#no read header
df= pd.read_csv('temp.csv', index_col=0 ,header=None)
解决方案1:
#no write header
dfn.to_csv('temp.csv', header=None)
#no read header
df= pd.read_csv('temp.csv', index_col=0 ,header=None, parse_dates=[0])
解决方案2:
#write header
dfn.to_csv('temp.csv')
#read header
df= pd.read_csv('temp.csv', index_col=0, parse_dates=[0])
不幸的是 parse_date
将日期转换为 UTC
,因此以后有必要添加时区:
df.index = df.index.tz_localize('UTC').tz_convert('Australia/Sydney')
print (df.index)
DatetimeIndex(['2004-01-02 01:00:00+11:00', '2004-01-02 02:00:00+11:00',
'2004-01-02 03:00:00+11:00', '2004-01-02 04:00:00+11:00',
'2004-01-02 05:00:00+11:00', '2004-01-02 06:00:00+11:00',
'2004-01-02 07:00:00+11:00', '2004-01-02 08:00:00+11:00',
'2004-01-02 09:00:00+11:00', '2004-01-02 10:00:00+11:00'],
dtype='datetime64[ns, Australia/Sydney]', name=0, freq=None)
测试样本:
idx = pd.date_range('2004-01-02 01:00:00', periods=10, freq='H', tz='Australia/Sydney')
dfn = pd.DataFrame({'col':range(len(idx))}, index=idx)
print (dfn)
col
2004-01-02 01:00:00+11:00 0
2004-01-02 02:00:00+11:00 1
2004-01-02 03:00:00+11:00 2
2004-01-02 04:00:00+11:00 3
2004-01-02 05:00:00+11:00 4
2004-01-02 06:00:00+11:00 5
2004-01-02 07:00:00+11:00 6
2004-01-02 08:00:00+11:00 7
2004-01-02 09:00:00+11:00 8
2004-01-02 10:00:00+11:00 9
我有一个具有时区感知索引的数据框
>>> dfn.index
Out[1]:
DatetimeIndex(['2004-01-02 01:00:00+11:00', '2004-01-02 02:00:00+11:00',
'2004-01-02 03:00:00+11:00', '2004-01-02 04:00:00+11:00',
'2004-01-02 21:00:00+11:00', '2004-01-02 22:00:00+11:00'],
dtype='datetime64[ns]', freq='H', tz='Australia/Sydney')
我保存在csv中,然后读取如下:
>>> dfn.to_csv('temp.csv')
>>> df= pd.read_csv('temp.csv', index_col=0 ,header=None )
>>> df.head()
Out[1]:
1
0
NaN 0.0000
2004-01-02 01:00:00+11:00 0.7519
2004-01-02 02:00:00+11:00 0.7520
2004-01-02 03:00:00+11:00 0.7515
2004-01-02 04:00:00+11:00 0.7502
索引被读取为字符串
>>> df.index[1]
Out[3]: '2004-01-02 01:00:00+11:00'
在转换 to_datetime 时,它会更改时间,因为它会将小时数增加 +11
>>> df.index = pd.to_datetime(df.index)
>>> df.index[1]
Out[6]: Timestamp('2004-01-01 14:00:00')
我现在可以从索引中减去 11 小时来修复它,但是有更好的方法来处理这个问题吗?
我尝试使用答案
我认为这是一个问题,您需要以相同的方式写入和读取文件头。
并且对于解析日期需要参数 parse_dates
.
#write to file header
dfn.to_csv('temp.csv')
#no read header
df= pd.read_csv('temp.csv', index_col=0 ,header=None)
解决方案1:
#no write header
dfn.to_csv('temp.csv', header=None)
#no read header
df= pd.read_csv('temp.csv', index_col=0 ,header=None, parse_dates=[0])
解决方案2:
#write header
dfn.to_csv('temp.csv')
#read header
df= pd.read_csv('temp.csv', index_col=0, parse_dates=[0])
不幸的是 parse_date
将日期转换为 UTC
,因此以后有必要添加时区:
df.index = df.index.tz_localize('UTC').tz_convert('Australia/Sydney')
print (df.index)
DatetimeIndex(['2004-01-02 01:00:00+11:00', '2004-01-02 02:00:00+11:00',
'2004-01-02 03:00:00+11:00', '2004-01-02 04:00:00+11:00',
'2004-01-02 05:00:00+11:00', '2004-01-02 06:00:00+11:00',
'2004-01-02 07:00:00+11:00', '2004-01-02 08:00:00+11:00',
'2004-01-02 09:00:00+11:00', '2004-01-02 10:00:00+11:00'],
dtype='datetime64[ns, Australia/Sydney]', name=0, freq=None)
测试样本:
idx = pd.date_range('2004-01-02 01:00:00', periods=10, freq='H', tz='Australia/Sydney')
dfn = pd.DataFrame({'col':range(len(idx))}, index=idx)
print (dfn)
col
2004-01-02 01:00:00+11:00 0
2004-01-02 02:00:00+11:00 1
2004-01-02 03:00:00+11:00 2
2004-01-02 04:00:00+11:00 3
2004-01-02 05:00:00+11:00 4
2004-01-02 06:00:00+11:00 5
2004-01-02 07:00:00+11:00 6
2004-01-02 08:00:00+11:00 7
2004-01-02 09:00:00+11:00 8
2004-01-02 10:00:00+11:00 9