非 KV 元素的 GroupIntoBatches

Question

根据 Apache Beam 2.0.0 SDK Documentation GroupIntoBatches 仅适用于 KV 个集合。

我的数据集只包含值，不需要引入键。但是，要使用 GroupIntoBatches，我必须使用空字符串作为键来实现“假”键：

static class FakeKVFn extends DoFn<String, KV<String, String>> {
  @ProcessElement
  public void processElement(ProcessContext c) {
    c.output(KV.of("", c.element()));
  }
}

因此整个管道如下所示：

public static void main(String[] args) {
  PipelineOptions options = PipelineOptionsFactory.create();
  Pipeline p = Pipeline.create(options);

  long batchSize = 100L;

  p.apply("ReadLines", TextIO.read().from("./input.txt"))
      .apply("FakeKV", ParDo.of(new FakeKVFn()))
      .apply(GroupIntoBatches.<String, String>ofSize(batchSize))
      .setCoder(KvCoder.of(StringUtf8Coder.of(), IterableCoder.of(StringUtf8Coder.of())))
      .apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, String>() {
        @ProcessElement
        public void processElement(ProcessContext c) {
          c.output(callWebService(c.element().getValue()));
        }
      }))
      .apply("WriteResults", TextIO.write().to("./output/"));

  p.run().waitUntilFinish();
}

有没有办法在不引入“假”密钥的情况下进行分组？

Answer 1

需要向 GroupIntoBatches 提供 KV 输入，因为转换是使用状态和定时器实现的，它们是按键和-window。

对于每个 key+window 对，状态和计时器必须连续执行（或明显如此）。您必须通过提供键（和 windows 来手动表达可用的并行度，尽管据我所知，目前没有运行器并行化超过 windows）。两种最常见的方法是：

使用一些自然键，比如用户 ID
随机选择一些固定数量的分片和密钥。这可能更难调整。您必须有足够的分片才能获得足够的并行性，但每个分片都需要包含足够的数据，GroupIntoBatches 实际上是有用的。

像您的代码片段中那样向所有元素添加一个虚拟键将导致转换根本无法并行执行。这类似于 Stateful indexing causes ParDo to be run single-threaded on Dataflow Runner.

中的讨论

非 KV 元素的 GroupIntoBatches

GroupIntoBatches for non-KV elements

apache-beam