Apache Flink:水印、删除延迟事件和允许延迟

Apache Flink: Watermarks, Dropping Late Events, and Allowed Lateness

我无法理解水印和允许迟到的概念。

以下是 [邮件存档|https://www.mail-archive.com/user@flink.apache.org/msg08758.html] 中关于水印的摘录,但我还有几个问题。下面是引用的例子:

Assume you have a BoundedOutOfOrdernessTimestampExtractor with a 2 minute bound and a 10 minute tumbling window that starts at 12:00 and ends at 12:10:

If you have the following stream sequence:

12:01, A
12:04, B
WM, 12:02 // 12:04 - 2 minutes
12:02, C
12:08, D
12:14, E
WM, 12:12
12:16, F
WM, 12:14 // 12:16 - 2 minutes
12:09, G

no allowed lateness

The window operator forwards the logical time to 12:12 when it receives <WM, 12:12> and evaluates the window which contains [A, B, C, D] at this time and finally purges its state. <12:09, G> is later ignored.

allowed lateness of 3 minutes

The window operator evaluates the window when <WM, 12:12> is received, but its state is not purged yet. The state is purged when <WM, 12:14> is received (window fire time 12:10 + 3mins allowed lateness). <12:09, G> is again ignored.

allowed lateness of 5 minutes

The window operator evaluates the window when <WM, 12:12> is received, but its state is not purged yet. When <12:09, G> is received, the window is again evaluated but this time with [A, B, C, D, G] and an update is sent out. The state is purged when a watermark of >= 12:15 is received.

据我了解:

  1. 水印应该告诉任何到达的事件时间戳小于水印的元素都将被丢弃。因此 12:02 的水印意味着 Flink 在事件时间 12:02 之前已经看到了它必须看到的所有内容。事件时间戳小于此水印的任何元素,例如12:01 将被丢弃。
  2. 允许迟到的概念仅适用于标记 window
  3. 结束的最后一个水印之后

我的问题基于理解:

  1. How is how is message "12:02,C" being entited consider Flink, with the previous watermark (WM, 12:02), 已经说了 "I have seen everything till活动时间12:02”?
  2. 我已经调整了流序列并插入了另一条记录 12:01,CCC 在流序列中如下面以粗体显示的点。

If you have the following stream sequence:

12:01, A
12:04, B
WM, 12:02 // 12:04 - 2 minutes
12:02, C
 12:01, CCC // Inserted by Sheel
12:08, D
12:14, E
WM, 12:12
12:16, F
WM, 12:14 // 12:16 - 2 minutes
12:09, G

这仍然在 12:00-12:10 window 但在水印 WM 之后,12:02。可以说允许的迟到是 5 分钟。此记录是否会被接受 "somehow" 将允许的迟到纳入画面,或者考虑到水印 12:02 已经超过,是否会被丢弃?

Watermarks 控制 window 的生命周期,但不直接控制记录是否被删除。当 Flink 的 WindowOperator 接收到一条新记录时,它会计算它落入的 windows 的集合。如果这个集合至少包含一个活动的window,也就是说没有比window的结束时间+允许迟到时间更高的水印,记录将分配给这个window 并将成为 window 计算的一部分(即使记录的时间戳低于上次看到的水印)。因此,可以说 windows 降低了与单个记录相关的水印分辨率。

在您的情况下,这意味着 CCCC 都将成为 window 12:00 - 12:10 的一部分,因为系统还没有看到 Watermark 与 >= 12:10, 然而.