Spark Streaming 是否可以基于事件时间进行窗口化?
Is windowing based on event time possible with Spark Streaming?
根据数据流模型论文:一种在大规模、无界、无序数据处理中平衡正确性、延迟和成本的实用方法:
MillWheel and Spark Streaming are both sufficiently scalable,
fault-tolerant, and low-latency to act as reasonable substrates, but
lack high-level programming models that make calculating event-time
sessions straightforward.
总是这样吗?
不,不是。
引用 https://dzone.com/articles/spark-streaming-vs-structured-streaming 以节省我的午餐时间!:
One big issue in the streaming world is how to process data according
to event-time.
Event-time is the time when the event actually happened. It is not
necessary for the source of the streaming engine to prove data in
real-time. There may be latencies in data generation and handing over
the data to the processing engine. There is no such option in Spark
Streaming to work on the data using the event-time. It only works with
the timestamp when the data is received by the Spark. Based on the
ingestion timestamp, Spark Streaming puts the data in a batch even if
the event is generated early and belonged to the earlier batch, which
may result in less accurate information as it is equal to the data
loss.
On the other hand, Structured Streaming provides the functionality to
process data on the basis of event-time when the timestamp of the
event is included in the data received. This is a major feature
introduced in Structured Streaming which provides a different way of
processing the data according to the time of data generation in the
real world. With this, we can handle data coming in late and get more
accurate results.
With event-time handling of late data, Structured Streaming outweighs
Spark Streaming.
根据数据流模型论文:一种在大规模、无界、无序数据处理中平衡正确性、延迟和成本的实用方法:
MillWheel and Spark Streaming are both sufficiently scalable, fault-tolerant, and low-latency to act as reasonable substrates, but lack high-level programming models that make calculating event-time sessions straightforward.
总是这样吗?
不,不是。
引用 https://dzone.com/articles/spark-streaming-vs-structured-streaming 以节省我的午餐时间!:
One big issue in the streaming world is how to process data according to event-time.
Event-time is the time when the event actually happened. It is not necessary for the source of the streaming engine to prove data in real-time. There may be latencies in data generation and handing over the data to the processing engine. There is no such option in Spark Streaming to work on the data using the event-time. It only works with the timestamp when the data is received by the Spark. Based on the ingestion timestamp, Spark Streaming puts the data in a batch even if the event is generated early and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss.
On the other hand, Structured Streaming provides the functionality to process data on the basis of event-time when the timestamp of the event is included in the data received. This is a major feature introduced in Structured Streaming which provides a different way of processing the data according to the time of data generation in the real world. With this, we can handle data coming in late and get more accurate results.
With event-time handling of late data, Structured Streaming outweighs Spark Streaming.