为什么在 Scio 中你更喜欢聚合而不是 groupByKey？

Why in Scio do you prefer aggregate over groupByKey?

发件人：

https://github.com/spotify/scio/wiki/Scio-data-guideline

"Prefer combine/aggregate/reduce transforms over groupByKey. Keep in mind that a reduce operation must be associative and commutative."

为什么人们更喜欢聚合而不是 groupByKey？

组合、聚合和缩减转换优于 groupByKey，因为前者在管道执行期间的内存效率更高。这是由于在 Apache Beam 中实现了原语 GroupByKey 和 Combine 转换。这个问题的答案不一定是 Scio 特有的。

GroupByKey 要求所有键值对都保留在内存中，这可能会导致 OutOfMemoryError。每 window 所有键值对都保留在内存中。 groupByKey 使用 Beam 的原始 GroupByKey 变换。

聚合消除了将所有值保存在内存中的需要，因为在执行转换期间值不断 combined/reduced。值是 combined/reduced 的非确定性顺序，这就是为什么所有 combine/reduce 操作必须是关联的。 Scio 的 aggregateByKey 实现使用 Beam 的原语 Combine transform.

参考文献：
1. Scio groupByKey
2. Scio aggregateByKey
3.阿帕奇光束 GroupByKey
4.阿帕奇光束 Combine
5. Google 云数据流 Combine

为什么在 Scio 中你更喜欢聚合而不是 groupByKey？

Why in Scio do you prefer aggregate over groupByKey?

scala

dataflow

apache-beam

spotify-scio