你如何在pyarrow中设置existing_data_behavior?

How do you set existing_data_behavior in pyarrow?

我遇到了这个错误。如何在编写数据集时更改行为 (write_dataset)

pyarrow.lib.ArrowInvalid: Could not write to <my-output-dir> as the directory is not empty and existing_data_behavior is to error

更新:如果您使用的是 6.0.0 版本,那么这是一个错误(见下文)。如果您使用的版本 >= 6.0.1,那么您可以将其指定为 write_dataset 调用的一部分:

import pyarrow as pa
import pyarrow.dataset as ds

tab = pa.Table.from_pydict({"x": [1, 2, 3], "y": ["x", "y", "z"]})
partitioning = ds.partitioning(schema=pa.schema([pa.field('y', pa.utf8())]), flavor='hive')
ds.write_dataset(tab, '/tmp/foo_dataset', format='parquet', partitioning=partitioning)
# This write would fail because data exists and the default
# is to not allow a potential overwrite
ds.write_dataset(tab, '/tmp/foo_dataset', format='parquet', partitioning=partitioning)
# By specifying existing_data_behavior we can change that
# default to return to the previous behavior
ds.write_dataset(tab, '/tmp/foo_dataset', format='parquet', partitioning=partitioning, existing_data_behavior='overwrite_or_ignore')

旧版 6.0.0 答案


不幸的是,这是一个错误:https://issues.apache.org/jira/browse/ARROW-14620

默认行为在 6.0.0 中发生了变化,因此如果目标目录中存在数据,write_dataset 方法将不会继续。覆盖此行为的标志未包含在 python 绑定中。

解决方法是使用旧版本或先删除目录中的所有文件。