在 pandas 中使用 pyarrow 忽略镶木地板中不存在的列

ignore columns not present in parquet with pyarrow in pandas

我正在尝试使用 pyarrow==1.0.1 作为引擎读取镶木地板。

给定:

columns = ['a','b','c']    
pd.read_parquet(x, columns=columns, engine="pyarrow")

如果文件x不包含c,它会给出:

/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()

/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset._scanner()

    /opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.from_dataset()
    
    /opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._populate_builder()
    
    /opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
    
    ArrowInvalid: Field named 'c' not found or not unique in the schema.

没有忽略警告的参数,只读取缺失的列作为 nan

错误处理也很糟糕。

pyarrow.lib.ArrowInvalid("Field named 'c' not found or not unique in the schema.")

很难得到丢失的文件名,所以它可以用来删除接下来传入的列 try

有方法吗?

您可以从 parquet 文件中读取元数据以确定哪些列可用。

请记住,pandas 无法猜测缺失列的类型(下例中的 c),这可能会在您稍后连接表格时出现问题。

import pandas as pd
import pyarrow.parquet as pq

all_columns = ['a', 'b', 'c']

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'z']})
file_name = '/tmp/my_df.pq'
df.to_parquet(file_name)

parquet_file = pq.ParquetFile(file_name)
columns_in_file = [c for c in all_columns if c in parquet_file.schema.names]
df = (
    parquet_file
        .read(columns=columns_in_file)
        .to_pandas()
        .reindex(columns=all_columns)
)