为什么完整输出模式需要聚合？

Question

我在 Apache Spark 2.2 中使用最新的 Structured Streaming 并遇到以下异常：

org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;

为什么完整输出模式需要流式聚合？如果 Spark 在流式查询中允许没有聚合的完整输出模式，会发生什么情况？

scala> spark.version
res0: String = 2.2.0

import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.SQLContext
implicit val sqlContext: SQLContext = spark.sqlContext
val source = MemoryStream[(Int, Int)]
val ids = source.toDS.toDF("time", "id").
  withColumn("time", $"time" cast "timestamp"). // <-- convert time column from Int to Timestamp
  dropDuplicates("id").
  withColumn("time", $"time" cast "long")  // <-- convert time column back from Timestamp to Int

import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
scala> val q = ids.
     |   writeStream.
     |   format("memory").
     |   queryName("dups").
     |   outputMode(OutputMode.Complete).  // <-- memory sink supports checkpointing for Complete output mode only
     |   trigger(Trigger.ProcessingTime(30.seconds)).
     |   option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save state between restarts
     |   start
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;
Project [cast(time#10 as bigint) AS time#15L, id#6]
+- Deduplicate [id#6], true
   +- Project [cast(time#5 as timestamp) AS time#10, id#6]
      +- Project [_1#2 AS time#5, _2#3 AS id#6]
         +- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2, _2#3]

  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:115)
  at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
  at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
  at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:247)
  ... 57 elided

Answer 1

来自 Structured Streaming Programming Guide - 其他查询（不包括聚合，mapGroupsWithState 和 flatMapGroupsWithState）：

Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table.

回答问题：

What would happen if Spark allowed Complete output mode with no aggregations in a streaming query?

可能 OOM。

令人费解的部分是为什么 dropDuplicates("id") 没有被标记为聚合。

Answer 2

我觉得是输出模式的问题。不要使用 OutputMode.Complete，而是使用 OutputMode.Append，如下所示。

scala> val q = ids
    .writeStream
    .format("memory")
    .queryName("dups")
    .outputMode(OutputMode.Append)
    .trigger(Trigger.ProcessingTime(30.seconds))
    .option("checkpointLocation", "checkpoint-dir")
    .start

为什么完整输出模式需要聚合？

Why does Complete output mode require aggregation?

apache-spark

spark-structured-streaming