在数据流中从 BigQuery 写入云存储时如何设置文件大小而不是分片数

Question

目前正在使用 Dataflow 从 BigQuery 中读取 table 数据并使用一定数量的分片写入 Cloud Storage。

//Read Main Input
PCollection<TableRow> input = pipeline.apply("ReadTableInput",
    BigQueryIO.readTableRows().from("dataset.table"));

// process and write files
input.apply("ProcessRows", ParDo.of(new Process())
    .apply("WriteToFile", TextIO.write()
        .to(outputFile)
        .withHeader(HEADER)
        .withSuffix(".csv")
        .withNumShards(numShards));

为了管理文件大小，我们估算了将文件保持在一定大小以下所需的分片总数。

有没有办法不设置分片数量，而是设置文件大小并让分片动态？

Answer 1

按照设计，这是不可能的。如果您深入研究 Beam 的核心，您将以编程方式定义一个执行图，然后运行它。该过程是大规模并行的（ParDo 表示 'Parallel Do'），在同一节点或多个 nodes/VM 上。

这里的分片数量就是 "writers" 并行写入文件的数量。然后PCollection会被拆分给所有的worker writing。

大小变化很大（消息的大小，例如，文本编码、压缩与否以及压缩因子等），Beam 不能依赖它来构建其图形。

在数据流中从 BigQuery 写入云存储时如何设置文件大小而不是分片数

How to set file size instead of number of shards when writing from BigQuery to Cloud Storage in Dataflow

java

google-cloud-storage

google-bigquery

google-cloud-dataflow