是否可以在 Apache Beam 中的 windows 之间传输状态

Question

根据 Apache Beam 文档： https://beam.apache.org/documentation/programming-guide/#state-and-timers

All state for a key is scoped to the current window. This means that the first time a key is seen for a given window any state reads will return empty, and that a runner can garbage collect state when a window is completed.

我有一个用例，当前 window 的输出需要被下一个 window 访问。

windows 固定 5 秒 windows 并执行计算，输出此 window 内覆盖的总距离。我需要将此距离添加到下一个 windows 总数中。目前，我通过将总数写入数据库并在下一个 window 中读取来实现这一点，但这会大大减慢处理速度。

所以我的问题是，状态是否可以在windows之间传输。或者我是否必须在总体全局 window 内设置一个具有 5 秒 windows 的全局 window？这可能吗？

Answer 1

据我所知，无法在 windows 之间传输状态，如果有某种形式的解决方法，我不会推荐它，因为它违背了 Beam 模型并且可能会产生意外行为。

然而，在坚持 Beam 模型的同时获得您想要的结果应该不会太困难。具体细节可能有点取决于您打算如何使用求和距离，但第一步保持不变：将 window 切换回全局 window，因此我们可以聚合在所有元素上。

Java 示例：

PCollection<Integer> distances = ... // The windowed distances.
PCollection<Integer> globalDistances =
        distances.apply(Window.<Integer>into(new GlobalWindows()));

既然距离是全局的 window，下面的转换取决于您想要的结果。

选项 1：结合触发器。一种简单的方法是使用像编程指南中的 Sum to add up all the distances, while adding a repeating trigger to the window to repeatedly fire early so results are regularly updated. This is mainly a solution for Streaming pipelines, as early firing triggers have no effect in Batch pipelines. See the Triggers section 这样的结合以了解详细信息。

Java 示例：

// Window fires a pane on every 10 elements received.
PCollection<Integer> globalDistances =
        distances.apply(Window.<Integer>into(new GlobalWindows())
                .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(10))));
PCollection<Int> summed = globalDistances.apply(Sum.integersGlobally())

方案二：Sum via State另一个方案更接近你的初衷。使用状态编写自定义 DoFn，例如 CombiningState，对接收到的每个元素求和并输出当前总和。这会将求和的每个步骤输出为一个单独的元素，而不必处理早期触发窗格或类似的东西。如果您明确需要求和的每个步骤都有一个单独的元素，或者如果您有 Batch 管道并且不能使用触发器，那么这是最佳选择。

是否可以在 Apache Beam 中的 windows 之间传输状态

Is it possible to transfer state between windows in Apache Beam

apache-beam