如何更新 ORC 流式接收器中的现有条目?

How to update an existing entry in ORC streaming sink?

当使用 Apache ORC 文件格式将流保存到文件中时,有没有办法对条目执行更新?而不是在更新条目时多次附加并有效地拥有一个条目。

incomingStreamDF.writeStream
  .format("orc")
  .option("path", "/mnt/adls/orc")
  .option("checkpointLocation", "/mnt/adls/orc/check")
  .trigger(ProcessingTime("25 seconds"))
  .start()

ORC好像支持update,有没有办法在writeStream选项中指明entry的key

tl;dr 否(直至并包括 Spark 2.4)

可以为您提供此类功能的唯一输出模式是 Update 输出模式。由于 orc 格式是 FileFormatmust always be used with Append output mode.


该问题的解决方案可能是使用全新的 DataStreamWriter.foreachBatch operator (or the older DataStreamWriter.foreach),您可以随心所欲地处理数据(如果您知道如何操作,您可以轻松地更新 ORC 文件中的条目).

foreachBatch(function: (Dataset[T], Long) ⇒ Unit): DataStreamWriter[T]

Sets the output of the streaming query to be processed using the provided function.

This is supported only in the micro-batch execution modes (that is, when the trigger is not continuous).

The provided function will be called in every micro-batch with:

(i) the output rows as a Dataset

(ii) the batch identifier.

The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems.

The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).