Spark Streaming 是否可以基于事件时间进行窗口化？

Is windowing based on event time possible with Spark Streaming?

根据数据流模型论文：一种在大规模、无界、无序数据处理中平衡正确性、延迟和成本的实用方法：

MillWheel and Spark Streaming are both sufficiently scalable, fault-tolerant, and low-latency to act as reasonable substrates, but lack high-level programming models that make calculating event-time sessions straightforward.

总是这样吗？

不，不是。

引用 https://dzone.com/articles/spark-streaming-vs-structured-streaming 以节省我的午餐时间！：

One big issue in the streaming world is how to process data according to event-time.

Event-time is the time when the event actually happened. It is not necessary for the source of the streaming engine to prove data in real-time. There may be latencies in data generation and handing over the data to the processing engine. There is no such option in Spark Streaming to work on the data using the event-time. It only works with the timestamp when the data is received by the Spark. Based on the ingestion timestamp, Spark Streaming puts the data in a batch even if the event is generated early and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss.

On the other hand, Structured Streaming provides the functionality to process data on the basis of event-time when the timestamp of the event is included in the data received. This is a major feature introduced in Structured Streaming which provides a different way of processing the data according to the time of data generation in the real world. With this, we can handle data coming in late and get more accurate results.

With event-time handling of late data, Structured Streaming outweighs Spark Streaming.

Spark Streaming 是否可以基于事件时间进行窗口化？

Is windowing based on event time possible with Spark Streaming?

streaming

dataflow

apache-spark

spark-streaming