spark streaming，如何跟踪已处理的源文件？

Question

Spark Streaming 如何跟踪已处理的文件？

问题1：假设一个场景，spark处理了今天的文件(a.csv, b.csv, c.csv)，3天后新文件(d.csv ) 已经到达，spark 怎么知道它必须处理唯一的 d.csv？这里遵循的基本机制是什么？

问题2：作为用户，我想知道文件是否真的被处理过，如何查看？

Answer 1

How does the spark streaming keeps the track of files which have been processed?

负责class的是FileStreamSource。您将在这里找到接下来 2 个问题的答案。

how does spark know it has to process the only d.csv? what is the underlying mechanism followed here?

A CompactibleFileStreamLog 用于根据上次修改时的时间戳维护可见文件的映射。基于这些条目，创建了一个不断增加的偏移量（参考文献 FileStreamSourceOffset）。这个偏移量是跨运行的检查点，就像 Kafka 等其他流媒体源一样。

whether the files have been really processed, how can I check?

从 code 我可以看到你可以通过 DEBUG 日志检查，

 batchFiles.foreach { file =>
  seenFiles.add(file._1, file._2)
  logDebug(s"New file: $file")
}

您可以检查的另一个地方是检查点数据，但由于它包含序列化偏移信息，我怀疑您能否从那里获得任何详细信息。

spark streaming, how to track processed source files?