Spark readStream 不会获取输入文件中的架构更改。如何解决？

Question

我有以下脚本可以使用 spark 结构化流读取 CDC 数据，然后才能将其合并到基本增量中 table。

streamDf = spark \
    .readStream \
    .format('csv') \
    .option("mergeSchema", "true") \
    .option('header', 'true') \
    .option("path", CDCLoadPath) \
    .load()

streamQuery = (streamDf \
               .writeStream \
               .format("delta") \
               .outputMode("append") \
               .foreachBatch(mergetoDelta) \
               .option("checkpointLocation", f"{CheckpointLoc}/_checkpoint") \
               .trigger(processingTime='20 seconds') \
               .start())

每当我在源 table 中添加一个新列时，读取流不会从源文件中获取架构更改，尽管基础数据有一个新列。但是如果我手动重新启动脚本，它会使用新列更新架构。有没有办法让流式传输在运行时接收它？

Answer 1

要么您需要一个提供输入模式的对象，要么您必须重新启动模式推理

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets

Spark readStream 不会获取输入文件中的架构更改。如何解决？

Spark readStream does not pick up schema changes in the input files. How to fix it?

apache-spark

spark-streaming

pyspark

spark-structured-streaming