Pyspark: Delta table 作为流源，怎么做？

Question

我在增量 table 上的 readStream 中遇到问题 table。

期待什么，参考以下link https://docs.databricks.com/delta/delta-streaming.html#delta-table-as-a-stream-source 例如：

spark.readStream.format("delta").table("events")  -- As expected, should work fine

问题，我试过以下方法：

df.write.format("delta").saveAsTable("deltatable")  -- Saved the Dataframe as a delta table

spark.readStream.format("delta").table("deltatable") -- Called readStream

错误：

Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: 'DataStreamReader' object has no attribute 'table'

注意：我是运行它在本地主机，使用pycharmIDE，安装了最新版本的 pyspark，spark 版本 = 2.4.5，Scala 版本 2.11.12

Answer 1

现在尝试使用 Delta Lake 0.7.0 release，它支持使用 Hive 元存储注册您的表。如评论中所述，大多数 Delta Lake 示例都使用文件夹路径，因为在此之前未集成 Metastore 支持。

另请注意，Delta Lake 的开源版本最好遵循 https://docs.delta.io/latest/index.html

上的文档

Answer 2

DataStreamReader.table 和 DataStreamWriter.table 方法还没有在 Apache Spark 中。目前您需要使用 Databricks Notebook 才能调用它们。

Pyspark: Delta table 作为流源，怎么做？

Pyspark: Delta table as stream source, How to do it?

apache-spark

pyspark

databricks

delta-lake