在加入 spark 时如何处理这个模糊的错误？

How to handle this obscure error when doing a join in spark?

我是运行一个 spark databricks 集群的成员。连接在两个实体之间，其中一个是分桶的。两个数据帧具有相同数量的分区并且按连接键 partitioned/bucketed。

我在运行时收到以下错误：

There should be only one distinct value of the number pre-shuffle partitions among registered Exchange operator

对于处理它的任何帮助，我将不胜感激。

当桶中有不同数量的预洗牌分区（即映射输出分区）时，就会发生这种情况。例如，如果您在一个存储桶中有 10 个分区，而在另一个存储桶中有 20 个分区，这应该会发生。

Spark 确保它不会为阶段获得不同数量的预洗牌分区。

来自Spark代码中的评论：

The reason that we are expecting a single value of the number of pre-shuffle partitions is that when we add Exchanges, we set the number of pre-shuffle partitions (i.e. map output partitions) using a static setting, which is the value of spark.sql.shuffle.partitions. Even if two input RDDs are having different number of partitions, they will have the same number of pre-shuffle partitions

因此您需要确保两个分桶数据帧在每个桶中具有相同数量的分区。

在加入 spark 时如何处理这个模糊的错误？

How to handle this obscure error when doing a join in spark?

apache-spark

pyspark

databricks