从 pandas DataFrame 加载 pyarrow parquet 时保留索引
Preserve index when loading pyarrow parquet from pandas DataFrame
我需要将带有字典值的字典转换为镶木地板,我的数据如下所示:
{"KEY":{"2018-12-06":250.0,"2018-12-07":234.0}}
我正在转换为 pandas 数据帧,然后写入 pyarrow table:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
data = {"KEY":{"2018-12-06":250.0,"2018-12-07":234.0}}
df = pd.DataFrame.from_dict(data, orient='index')
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, 'file.parquet', flavor='spark')
我得到的数据只有日期和值,但没有字典的键。:
{"2018-12-06":250.0,"2018-12-07":234.0}
我需要的是也有数据的key:
{"KEY": {"2018-12-06":250.0,"2018-12-07":234.0}}
如果您想保留索引,那么您应该这样指定;设置 preserve_index=True
:
table = pa.Table.from_pandas(df, preserve_index=True)
pq.write_table(table, 'file.parquet', flavor='spark')
pq.read_table('file.parquet').to_pandas() # Index is preserved.
2018-12-06 2018-12-07
KEY 250.0 234.0
我观察到一个相关但独立的问题,即 DateTimeIndex 的频率类型在从 pandas 到 table 的往返过程中未保留。
例如:
>>> import pandas as pd
>>> import pyarrow as pa
>>> from collections import OrderedDict
>>>
>>>
>>> pd.__version__
'1.1.5'
>>>
>>> pa.__version__
'4.0.1'
>>>
>>> dates = pd.date_range(start='2016-04-01', periods=4, name='DATE')
>>> dict_data = OrderedDict()
>>> dict_data['A'] = list('AABB')
>>> dict_data['B'] = list('abab')
>>> dict_data['C'] = list('wxyz')
>>> dict_data['D'] = range(0, 4)
>>> df = pd.DataFrame.from_dict(dict_data)
>>> df = df.set_index(dates)
>>>
>>> df.index
DatetimeIndex(['2016-04-01', '2016-04-02', '2016-04-03', '2016-04-04'], dtype='datetime64[ns]', name='DATE', freq='D')
>>>
>>> table = pa.Table.from_pandas(df, preserve_index=True)
>>> df2 = table.to_pandas()
>>> df2.index
DatetimeIndex(['2016-04-01', '2016-04-02', '2016-04-03', '2016-04-04'], dtype='datetime64[ns]', name='DATE', freq=None)
我需要将带有字典值的字典转换为镶木地板,我的数据如下所示:
{"KEY":{"2018-12-06":250.0,"2018-12-07":234.0}}
我正在转换为 pandas 数据帧,然后写入 pyarrow table:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
data = {"KEY":{"2018-12-06":250.0,"2018-12-07":234.0}}
df = pd.DataFrame.from_dict(data, orient='index')
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, 'file.parquet', flavor='spark')
我得到的数据只有日期和值,但没有字典的键。:
{"2018-12-06":250.0,"2018-12-07":234.0}
我需要的是也有数据的key:
{"KEY": {"2018-12-06":250.0,"2018-12-07":234.0}}
如果您想保留索引,那么您应该这样指定;设置 preserve_index=True
:
table = pa.Table.from_pandas(df, preserve_index=True)
pq.write_table(table, 'file.parquet', flavor='spark')
pq.read_table('file.parquet').to_pandas() # Index is preserved.
2018-12-06 2018-12-07
KEY 250.0 234.0
我观察到一个相关但独立的问题,即 DateTimeIndex 的频率类型在从 pandas 到 table 的往返过程中未保留。
例如:
>>> import pandas as pd
>>> import pyarrow as pa
>>> from collections import OrderedDict
>>>
>>>
>>> pd.__version__
'1.1.5'
>>>
>>> pa.__version__
'4.0.1'
>>>
>>> dates = pd.date_range(start='2016-04-01', periods=4, name='DATE')
>>> dict_data = OrderedDict()
>>> dict_data['A'] = list('AABB')
>>> dict_data['B'] = list('abab')
>>> dict_data['C'] = list('wxyz')
>>> dict_data['D'] = range(0, 4)
>>> df = pd.DataFrame.from_dict(dict_data)
>>> df = df.set_index(dates)
>>>
>>> df.index
DatetimeIndex(['2016-04-01', '2016-04-02', '2016-04-03', '2016-04-04'], dtype='datetime64[ns]', name='DATE', freq='D')
>>>
>>> table = pa.Table.from_pandas(df, preserve_index=True)
>>> df2 = table.to_pandas()
>>> df2.index
DatetimeIndex(['2016-04-01', '2016-04-02', '2016-04-03', '2016-04-04'], dtype='datetime64[ns]', name='DATE', freq=None)