spark "basePath" 选项设置

Question

当我这样做时：

allf = spark.read.parquet("gs://bucket/folder/*")

我得到：

java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:

...路径列表后的以下消息：

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.

我是 Spark 的新手。我相信我的数据源实际上是 "folders"（类似于 base/top_folder/year=x/month=y/*.parquet）的集合，我想加载所有文件并转换它们。

感谢您的帮助！

更新 1：我查看了 Dataproc 控制台，在创建集群时无法设置 "options"。
更新 2：我检查了集群的 "cluster.properties" 文件，没有这样的选项。难道我必须添加一个并重置集群？

Answer 1

根据关于 Parquet partition discovery 的 Spark 文档，我相信将您的加载语句从

allf = spark.read.parquet("gs://bucket/folder/*")

到

allf = spark.read.parquet("gs://bucket/folder")

应该发现并加载所有镶木地板分区。这是假设数据是以“文件夹”作为其基本目录写入的。

如果目录 base/folder 实际上包含多个数据集，您将需要独立加载每个数据集，然后将它们联合在一起。

spark "basePath" 选项设置

spark "basePath" option setting

apache-spark

pyspark

google-cloud-dataproc