pandas 数据帧使用 pyarrow 分区并另存为镶木地板文件时不保留数据类型
Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow
使用 pyarrow 将 pandas 数据框分区并另存为 parquet 文件时,不会保留数据类型。
案例 1:保存分区数据集 - 不保留数据类型
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)
# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
输出:
Datatypes before saving the dataset
age int64
name object
dtype: object
Datatypes after loading the dataset
name object
age category
dtype: object
案例 2:非分区数据集 - 保留数据类型
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
输出:
Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object
Datatypes after loading the dataset
age int64
name object
dtype: object
没有明显的方法可以做到这一点。请参考下面的JIRA问题。
使用 pyarrow 将 pandas 数据框分区并另存为 parquet 文件时,不会保留数据类型。
案例 1:保存分区数据集 - 不保留数据类型
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)
# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
输出:
Datatypes before saving the dataset
age int64
name object
dtype: object
Datatypes after loading the dataset
name object
age category
dtype: object
案例 2:非分区数据集 - 保留数据类型
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
输出:
Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object
Datatypes after loading the dataset
age int64
name object
dtype: object
没有明显的方法可以做到这一点。请参考下面的JIRA问题。