在 Azure Data Lake Storage Gen1 中将 Spark Dataframe 保存为 Delta Table 时，有没有办法在写入之前告知将创建多少文件？

Question

我目前正在尝试将 Spark Dataframe 保存到 Azure Data Lake Storage (ADLS) Gen1。这样做时，我收到以下节流错误：

org.apache.spark.SparkException: Job aborted. Caused by: com.microsoft.azure.datalake.store.ADLException: Error creating file /user/DEGI/CLCPM_DATA/fraud_project/policy_risk_motorcar_with_lookups/part-00000-34d88646-3755-488d-af00-ef2e201240c8-c000.snappy.parquet
Operation CREATE failed with HTTP401 : null
Last encountered exception thrown after 2 tries. [HTTP401(null),HTTP401(null)]

我在 documentation 中读到，限制是由于 CREATE 限制而发生的，这会导致作业中止。该文档还给出了可能发生这种情况的三个原因。

您的应用程序创建了大量小文件。
外部应用程序创建大量文件。
订阅的当前限制太低。

虽然我不认为我的订阅太低，但我认为可能是我的应用程序创建了过多的镶木地板文件。有谁知道如何判断保存为 table 时将创建多少个文件？我怎样才能找出允许我创建的最大文件数？

我用来创建 table 的代码如下所示：

df.write.format("delta").mode("overwrite").saveAsTable("database_name.df", path ='adl://my path to storage')

此外，我能够在没有任何 problems.Plus 的情况下编写一个较小的测试数据框。adls 中的文件夹权限设置正确。

Answer 1

您遇到的错误看起来不像是文件数量的问题。 401 是未经授权的问题。尽管如此：

Spark 至少写入与分区一样多的文件。所以你想要做的是重新分区你的数据框。有几个repartitionapi，为了减少分区和数据分布，推荐使用coalesce()

df.coalesce(10).write....

您还可以阅读

在 Azure Data Lake Storage Gen1 中将 Spark Dataframe 保存为 Delta Table 时，有没有办法在写入之前告知将创建多少文件？

Is there a way to tell before the write how many files will be created when saving Spark Dataframe as Delta Table in Azure Data Lake Storage Gen1?

python

azure

apache-spark

azure-data-lake

delta-lake