如何流式传输所有文件的 hdfs 位置并同时写入另一个 hdfs 位置

Question

我在一个 hdfs 位置有大约 20K JSON 个镶木地板格式的文件。我的工作是流式传输位置并读取数据帧中的所有文件，然后将相同的文件写入另一个 hdfs 位置。

有人可以建议我该怎么做吗？我正在使用 Azure Databricks 平台和 pyspark 来完成这个任务。

Answer 1

我不确定您是想以 "streaming" 方式还是 "batch" 方式进行。但是，您可以使用流式处理功能并触发一次作业。

    (spark
.readStream # Read data as streaming
.schema(USER_SCHEMA) # For streaming, you must provide the input schema of data
.format("parquet")
.load(PARQUET_ORIGIN_LOCATION)
.writeStream
.format("delta")
.option("path", PARQUET_DESTINATION_LOCATION + 'data/')  # Where to store the data
.option("checkpointLocation", PARQUET_DESTINATION_LOCATION + 'checkpoint/')  # The check point location
.option("overwriteSchema", True)  # Allows the schema to be overwritten
.queryName(QUERY_NAME)  # Name of the query
.trigger(once=True)  # For Batch Processing
.start()
)

如何流式传输所有文件的 hdfs 位置并同时写入另一个 hdfs 位置

How to stream an hdfs location for all files and write to another hdfs location simultaneously

hdfs

pyspark

azure-data-lake

azure-databricks