Spark Structured Streaming - 流式数据与静态数据相结合，每 5 分钟刷新一次

Spark Structured Streaming - Streaming data joined with static data which will be refreshed every 5 mins

对于 spark 结构化流作业，一个输入来自 kafka 主题，而第二个输入是一个文件（将由 python API 每 5 分钟刷新一次）。我需要加入这两个输入并写入一个 kafka 主题。

我面临的问题是当刷新第二个输入文件并且 spark 流作业正在读取文件时，我收到以下错误：

File file:/home/hduser/code/new/collect_ip1/part-00163-55e17a3c-f524-4dac-89a4-b9e12f1a79df-c000.csv does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by recreating the Dataset/DataFrame involved.

任何帮助将不胜感激。

使用 HBase 作为静态存储。这肯定是更多的工作，但允许并发更新。

在我工作的地方，所有 Spark Streaming 都使用 HBase 来查找数据。快多了。如果您有 1 亿客户需要 1 万条记录的微批次怎么办？我知道最初的工作量很大。

见https://medium.com/@anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc

如果你有一个小的静态引用 table，那么静态连接没问题，但你也有更新，导致问题。

Spark Structured Streaming - 流式数据与静态数据相结合，每 5 分钟刷新一次

Spark Structured Streaming - Streaming data joined with static data which will be refreshed every 5 mins

apache-spark

spark-structured-streaming

spark-streaming-kafka