读取 pyarrow.dataset.Dataset 中的每第 n 个批次

Question

在 Pyarrow 现在你可以做：

a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)

如果我希望迭代器每第 N 批 return 我而不是每隔一批呢？似乎这可能是 FragmentScanOptions 中的内容，但根本没有记录。

Answer 1

不，今天没有办法做到这一点。我不确定你在追求什么，但如果你想对你的数据进行采样，有几个选择 none 可以达到这种效果。

要仅从磁盘加载一小部分数据，您可以使用 pyarrow.dataset.head
有一个 request in place for randomly sampling a dataset 尽管建议的实现仍会将所有数据加载到内存中（并且只是根据某个随机概率删除行）。

更新：如果您的数据集只是镶木地板文件，那么您可以拼凑一些相当自定义的部分和片段来实现您想要的。

a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
  for row_group_fragment in fragment.split_by_row_group():
    all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()

读取 pyarrow.dataset.Dataset 中的每第 n 个批次

Read every nth batch in pyarrow.dataset.Dataset

pyarrow