Apache Flink:水印、删除延迟事件和允许延迟
Apache Flink: Watermarks, Dropping Late Events, and Allowed Lateness
我无法理解水印和允许迟到的概念。
以下是 [邮件存档|https://www.mail-archive.com/user@flink.apache.org/msg08758.html] 中关于水印的摘录,但我还有几个问题。下面是引用的例子:
Assume you have a BoundedOutOfOrdernessTimestampExtractor
with a 2 minute bound and a 10 minute tumbling window that starts at 12:00 and ends at 12:10:
If you have the following stream sequence:
12:01, A
12:04, B
WM, 12:02 // 12:04 - 2 minutes
12:02, C
12:08, D
12:14, E
WM, 12:12
12:16, F
WM, 12:14 // 12:16 - 2 minutes
12:09, G
no allowed lateness
The window operator forwards the logical time to 12:12 when it receives <WM, 12:12>
and evaluates the window which contains [A, B, C, D]
at this time and finally purges its state. <12:09, G>
is later ignored.
allowed lateness of 3 minutes
The window operator evaluates the window when <WM, 12:12>
is received, but its state is not purged yet. The state is purged when <WM, 12:14>
is received (window fire time 12:10 + 3mins allowed lateness). <12:09, G>
is again ignored.
allowed lateness of 5 minutes
The window operator evaluates the window when <WM, 12:12>
is received, but its state is not purged yet. When <12:09, G>
is received, the window is again evaluated but this time with [A, B, C, D, G]
and an update is sent out. The state is purged when a watermark of >= 12:15 is received.
据我了解:
- 水印应该告诉任何到达的事件时间戳小于水印的元素都将被丢弃。因此 12:02 的水印意味着 Flink 在事件时间 12:02 之前已经看到了它必须看到的所有内容。事件时间戳小于此水印的任何元素,例如12:01 将被丢弃。
- 允许迟到的概念仅适用于标记 window
结束的最后一个水印之后
我的问题基于理解:
- How is how is message "12:02,C" being entited consider Flink, with the previous watermark (WM, 12:02), 已经说了 "I have seen everything till活动时间12:02”?
- 我已经调整了流序列并插入了另一条记录 12:01,CCC 在流序列中如下面以粗体显示的点。
If you have the following stream sequence:
12:01, A
12:04, B
WM, 12:02 // 12:04 - 2 minutes
12:02, C
12:01, CCC // Inserted by Sheel
12:08, D
12:14, E
WM, 12:12
12:16, F
WM, 12:14 // 12:16 - 2 minutes
12:09, G
这仍然在 12:00-12:10 window 但在水印 WM 之后,12:02。可以说允许的迟到是 5 分钟。此记录是否会被接受 "somehow" 将允许的迟到纳入画面,或者考虑到水印 12:02 已经超过,是否会被丢弃?
Watermarks
控制 window 的生命周期,但不直接控制记录是否被删除。当 Flink 的 WindowOperator
接收到一条新记录时,它会计算它落入的 windows 的集合。如果这个集合至少包含一个活动的window,也就是说没有比window的结束时间+允许迟到时间更高的水印,记录将分配给这个window 并将成为 window 计算的一部分(即使记录的时间戳低于上次看到的水印)。因此,可以说 windows 降低了与单个记录相关的水印分辨率。
在您的情况下,这意味着 C
和 CCC
都将成为 window 12:00 - 12:10
的一部分,因为系统还没有看到 Watermark
与 >= 12:10
, 然而.
我无法理解水印和允许迟到的概念。
以下是 [邮件存档|https://www.mail-archive.com/user@flink.apache.org/msg08758.html] 中关于水印的摘录,但我还有几个问题。下面是引用的例子:
Assume you have a
BoundedOutOfOrdernessTimestampExtractor
with a 2 minute bound and a 10 minute tumbling window that starts at 12:00 and ends at 12:10:If you have the following stream sequence:
12:01, A 12:04, B WM, 12:02 // 12:04 - 2 minutes 12:02, C 12:08, D 12:14, E WM, 12:12 12:16, F WM, 12:14 // 12:16 - 2 minutes 12:09, G
no allowed lateness
The window operator forwards the logical time to 12:12 when it receives
<WM, 12:12>
and evaluates the window which contains[A, B, C, D]
at this time and finally purges its state.<12:09, G>
is later ignored.allowed lateness of 3 minutes
The window operator evaluates the window when
<WM, 12:12>
is received, but its state is not purged yet. The state is purged when<WM, 12:14>
is received (window fire time 12:10 + 3mins allowed lateness).<12:09, G>
is again ignored.allowed lateness of 5 minutes
The window operator evaluates the window when
<WM, 12:12>
is received, but its state is not purged yet. When<12:09, G>
is received, the window is again evaluated but this time with[A, B, C, D, G]
and an update is sent out. The state is purged when a watermark of >= 12:15 is received.
据我了解:
- 水印应该告诉任何到达的事件时间戳小于水印的元素都将被丢弃。因此 12:02 的水印意味着 Flink 在事件时间 12:02 之前已经看到了它必须看到的所有内容。事件时间戳小于此水印的任何元素,例如12:01 将被丢弃。
- 允许迟到的概念仅适用于标记 window 结束的最后一个水印之后
我的问题基于理解:
- How is how is message "12:02,C" being entited consider Flink, with the previous watermark (WM, 12:02), 已经说了 "I have seen everything till活动时间12:02”?
- 我已经调整了流序列并插入了另一条记录 12:01,CCC 在流序列中如下面以粗体显示的点。
If you have the following stream sequence:
12:01, A 12:04, B WM, 12:02 // 12:04 - 2 minutes 12:02, C 12:01, CCC // Inserted by Sheel 12:08, D 12:14, E WM, 12:12 12:16, F WM, 12:14 // 12:16 - 2 minutes 12:09, G
这仍然在 12:00-12:10 window 但在水印 WM 之后,12:02。可以说允许的迟到是 5 分钟。此记录是否会被接受 "somehow" 将允许的迟到纳入画面,或者考虑到水印 12:02 已经超过,是否会被丢弃?
Watermarks
控制 window 的生命周期,但不直接控制记录是否被删除。当 Flink 的 WindowOperator
接收到一条新记录时,它会计算它落入的 windows 的集合。如果这个集合至少包含一个活动的window,也就是说没有比window的结束时间+允许迟到时间更高的水印,记录将分配给这个window 并将成为 window 计算的一部分(即使记录的时间戳低于上次看到的水印)。因此,可以说 windows 降低了与单个记录相关的水印分辨率。
在您的情况下,这意味着 C
和 CCC
都将成为 window 12:00 - 12:10
的一部分,因为系统还没有看到 Watermark
与 >= 12:10
, 然而.