GroupByKey 转换中 Iterables 的性质

Nature of Iterables in GroupByKey Transform

google-cloud-dataflow
apache-beam

我正在通过 Java SDK 使用 Google 数据流。 GroupByKey 将 returns 转换为 KV PCollection 的 "value" 部分中的 Iterable。假设我们运行对 GroupByKey 转换的 KV 结果进行 ParDo。谁能告诉我 Iterable 对象的 "nature"：Does the Iterable hold a fully pre-populated list，这意味着假设 Iterable 中有 1000 个 Integers，它消耗的内存为 1000*sizeof(Integer)在那个节点上。或者，是否对 Iterable 进行了评估 "lazily"（类似于 Python 中的生成器），无论 Iterable 有多大，它都能确保非常小的内存消耗。

这些迭代器是惰性的，至少当运行在 Dataflow runner 上时，它们允许每个键保存比内存中容纳的更多的数据。当您通过 Iterable 时，键的值会延迟加载到内存中。

GroupByKey 转换中 Iterables 的性质

Nature of Iterables in GroupByKey Transform

google-cloud-dataflow

apache-beam