Spark Streaming 状态存储在哪里？

Where is Spark Streamings state stored?

我正在使用带有 updateStateByKey() 和 mapWithState() 函数的 Spark Streaming，但我不清楚状态保存在哪里。状态是否保存在 HDFS 上？或者它是内存存储？如何保证容错性？

状态存储在 Streaming Context 启动期间指定的检查点目录中。有关检查点及其提供的容错的更多信息，请参阅 documentation

Spark Streaming 目前有两种有状态流的实现。一个是较旧的 PairRDDFunctions.updateStateByKey (Spark <= 1.5.0) ，它使用 CoGroupedRDD to store the state for each key. The newer version called PairRDDFunctions.mapWithState (Spark >= 1.6.0) uses a OpenHashMapBasedStateMap[K, V] 来存储内部状态。这两个都是内存实现

这两个有状态流都使用 checkpointing 作为 持久性 容错机制。检查点位置可以是 HDFS 或 Amazon 的 S3，其中数据在每个时间间隔内持久保存，该时间间隔由用户使用 DStream.checkpoint 定义或默认为（批次间隔 * 常量）。使用有状态流时，您有义务指定检查点目录。

Spark Streaming 状态存储在哪里？

Where is Spark Streamings state stored?

apache-spark

spark-streaming