Cloud Dataflow 到 BigQuery - 来源太多

Question

我有一份工作，其中包括将从文件中读取的一些数据插入 BigQuery table 以供以后手动分析。

失败并出现以下错误：

job error: Too many sources provided: 10001. Limit is 10000., error: Too many sources provided: 10001. Limit is 10000.

"source" 指的是什么？是文件还是流水线步骤？

谢谢， G

Answer 1

我猜错误来自 BigQuery，这意味着我们在创建您的输出时试图上传太多文件 table。

您能否提供有关错误/上下文的更多详细信息（例如命令行输出的片段（如果使用 BlockingDataflowPipelineRunner）以便我确认？jobId 也会有所帮助。

您的管道结构是否会导致产生大量输出文件？这可能是大量数据，也可能是没有后续 GroupByKey 操作的精细分片输入文件（这会让我们将数据重新分片成更大的部分）。

Answer 2

public static class ForceGroupBy <T> extends PTransform<PCollection<T>, PCollection<KV<T, Iterable<Void>>>> {
    private static final long serialVersionUID = 1L;
    @Override
    public PCollection<KV<T, Iterable<Void>>> apply(PCollection<T> input) {
        PCollection<KV<T,Void>> syntheticGroup = input.apply(
                ParDo.of(new  DoFn<T,KV<T,Void>>(){
                    private static final long serialVersionUID = 1L;
                    @Override
                    public void processElement(
                            DoFn<T, KV<T, Void>>.ProcessContext c)
                                    throws Exception {
                        c.output(KV.of(c.element(),(Void)null));

                    } }));
        return syntheticGroup.apply(GroupByKey.<T,Void>create());
    }
}

Answer 3

In 中的注释缓解了这个问题：

Dataflow SDK for Java 1.x: as a workaround, you can enable this experiment in : --experiments=enable_custom_bigquery_sink

In Dataflow SDK for Java 2.x, this behavior is default and no experiments are necessary.

Note that in both versions, temporary files in GCS may be left over if your job fails.

Cloud Dataflow 到 BigQuery - 来源太多

Cloud Dataflow to BigQuery - too many sources

google-cloud-dataflow