flink是否以批处理模式动态减少
Does flink reduce on the fly in batch mode
根据 flink 流式文档:
The window function can be one of ReduceFunction, FoldFunction or
WindowFunction. The first two can be executed more efficiently (see
State Size section) because Flink can incrementally aggregate the
elements for each window as they arrive.
批处理模式也一样吗?在下面的示例中,我正在从 cassandra 读取 ~36go 数据,但我预计减少的输出会小得多(~0.5go)。 运行 这项工作是否需要 flink 将整个输入存储在内存中,或者它是否足够智能以迭代它
DataSet<MyRecord> input = ...;
DataSet<MyRecord> sampled = input
.groupBy(MyRecord::getSampleKey)
.reduce(MyRecord::keepLast);
根据 Flink 中的 documentation on the Reduce Operation,我看到以下内容:
A Reduce transformation that is applied on a grouped DataSet reduces
each group to a single element using a user-defined reduce function.
For each group of input elements, a reduce function successively
combines pairs of elements into one element until only a single
element for each group remains.
Note that for a ReduceFunction the keyed fields of the returned object
should match the input values. This is because reduce is implicitly
combinable and objects emitted from the combine operator are again
grouped by key when passed to the reduce operator.
如果我没看错,Flink 在 mapper 端执行 reduce 操作,然后在 reducer 端再次执行 reduce 操作,所以实际上emitted/serialized的数据应该很小。
根据 flink 流式文档:
The window function can be one of ReduceFunction, FoldFunction or WindowFunction. The first two can be executed more efficiently (see State Size section) because Flink can incrementally aggregate the elements for each window as they arrive.
批处理模式也一样吗?在下面的示例中,我正在从 cassandra 读取 ~36go 数据,但我预计减少的输出会小得多(~0.5go)。 运行 这项工作是否需要 flink 将整个输入存储在内存中,或者它是否足够智能以迭代它
DataSet<MyRecord> input = ...;
DataSet<MyRecord> sampled = input
.groupBy(MyRecord::getSampleKey)
.reduce(MyRecord::keepLast);
根据 Flink 中的 documentation on the Reduce Operation,我看到以下内容:
A Reduce transformation that is applied on a grouped DataSet reduces each group to a single element using a user-defined reduce function. For each group of input elements, a reduce function successively combines pairs of elements into one element until only a single element for each group remains.
Note that for a ReduceFunction the keyed fields of the returned object should match the input values. This is because reduce is implicitly combinable and objects emitted from the combine operator are again grouped by key when passed to the reduce operator.
如果我没看错,Flink 在 mapper 端执行 reduce 操作,然后在 reducer 端再次执行 reduce 操作,所以实际上emitted/serialized的数据应该很小。