了解 Flink 中 Operator 之间的数据传输（批处理）

Understanding data transferring between Operators in Flink (Batch)

我仍在为如何在不同运算符之间 flink“exchanges/transffers”数据以及运算符之间的实际数据发生什么而苦苦思索。

以上面的DAG为例： DAG of execution

DataSet 得到 forwarded/transferred 到 GroupReduce 运算符的所有并行实例，数据根据 GroupReduce 转换得到减少。
All 的新数据被转发到 Filter->Map->Map 操作数，即 GroupReduce 的并行实例之一消耗的所有数据operator 被转移到 Filter->Map->Map operator 的一个实例（不需要 serialization/deserialization，因此 Operator 访问 GroupReduce Operator 生成的数据）
All 在 (Filter->Map) Operator ( serialization/deserialization 运算符之间需要）

因此，例如，如果 GroupReduce 运算符的输出约为 100MB，它会将 100MB 转发到 (Filter->Map->Map) 操作数并散列该 100MB 的副本并将其传输到 (Filter->地图）实例。所以我将生成另外 100MB 的网络流量

我很困惑为什么在 GroupReduce 之后和过滤步骤之前有这么多网络流量。在将现在减少的数据发送给后续操作员之前，将 GroupRedcue 和 Filter 步骤链接在一起不是更好吗？

GroupReduce 功能与使用 MapReduce 编程模型中的组合器相同。

Partial computation can significantly improve the performance of a GroupReduceFunction. This technique is also known as applying a Combiner. Implement the GroupCombineFunction interface to enable partial computations, i.e., a combiner for this GroupReduceFunction.

因此，在组合器之后总是有一个洗牌 phase/partition 将所有上游运算符连接到所有下游运算符。检查 this answer 以阐明什么是组合器。

了解 Flink 中 Operator 之间的数据传输（批处理）

Understanding data transferring between Operators in Flink (Batch)

apache-flink

flink-batch