在 pandas 中使用 pyarrow 忽略镶木地板中不存在的列
ignore columns not present in parquet with pyarrow in pandas
我正在尝试使用 pyarrow==1.0.1
作为引擎读取镶木地板。
给定:
columns = ['a','b','c']
pd.read_parquet(x, columns=columns, engine="pyarrow")
如果文件x
不包含c
,它会给出:
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset._scanner()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.from_dataset()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._populate_builder()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Field named 'c' not found or not unique in the schema.
没有忽略警告的参数,只读取缺失的列作为 nan
。
错误处理也很糟糕。
pyarrow.lib.ArrowInvalid("Field named 'c' not found or not unique in the schema.")
很难得到丢失的文件名,所以它可以用来删除接下来传入的列 try
。
有方法吗?
您可以从 parquet 文件中读取元数据以确定哪些列可用。
请记住,pandas 无法猜测缺失列的类型(下例中的 c),这可能会在您稍后连接表格时出现问题。
import pandas as pd
import pyarrow.parquet as pq
all_columns = ['a', 'b', 'c']
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'z']})
file_name = '/tmp/my_df.pq'
df.to_parquet(file_name)
parquet_file = pq.ParquetFile(file_name)
columns_in_file = [c for c in all_columns if c in parquet_file.schema.names]
df = (
parquet_file
.read(columns=columns_in_file)
.to_pandas()
.reindex(columns=all_columns)
)
我正在尝试使用 pyarrow==1.0.1
作为引擎读取镶木地板。
给定:
columns = ['a','b','c']
pd.read_parquet(x, columns=columns, engine="pyarrow")
如果文件x
不包含c
,它会给出:
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset._scanner()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.from_dataset()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._populate_builder()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Field named 'c' not found or not unique in the schema.
没有忽略警告的参数,只读取缺失的列作为 nan
。
错误处理也很糟糕。
pyarrow.lib.ArrowInvalid("Field named 'c' not found or not unique in the schema.")
很难得到丢失的文件名,所以它可以用来删除接下来传入的列 try
。
有方法吗?
您可以从 parquet 文件中读取元数据以确定哪些列可用。
请记住,pandas 无法猜测缺失列的类型(下例中的 c),这可能会在您稍后连接表格时出现问题。
import pandas as pd
import pyarrow.parquet as pq
all_columns = ['a', 'b', 'c']
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'z']})
file_name = '/tmp/my_df.pq'
df.to_parquet(file_name)
parquet_file = pq.ParquetFile(file_name)
columns_in_file = [c for c in all_columns if c in parquet_file.schema.names]
df = (
parquet_file
.read(columns=columns_in_file)
.to_pandas()
.reindex(columns=all_columns)
)