spark.sql.crossJoin.enabled 对于 Spark 2.x

Question

我正在使用 'preview' Google DataProc Image 1.1 和 Spark 2.0.0。要完成我的一项操作，我必须完成笛卡尔积。自版本 2.0.0 以来，创建了一个禁止笛卡尔积的 spark 配置参数 (spark.sql.cross Join.enabled) 并抛出异常。如何设置 spark.sql.crossJoin.enabled=true，最好是使用初始化操作？ spark.sql.crossJoin.enabled=true

Answer 1

Spark >= 3.0

spark.sql.crossJoin.enable 默认为真 (SPARK-28621)。

Spark >= 2.1

您可以使用 crossJoin:

df1.crossJoin(df2)

它使您的意图明确，并保留更保守的配置，以保护您免受意外交叉连接的影响。

Spark 2.0

SQL 属性可以在运行时使用 RuntimeConfig.set 方法动态设置，因此您应该能够调用

spark.conf.set("spark.sql.crossJoin.enabled", true)

当您想要明确允许笛卡尔积时。

Answer 2

要更改 Dataproc 中配置设置的默认值，您甚至不需要 init 操作，您可以在从命令行创建集群时使用 --properties flag：

gcloud dataproc clusters create --properties spark:spark.sql.crossJoin.enabled=true my-cluster ...

Answer 3

TPCDS 查询集基准测试包含包含 CROSS JOINS 的查询，除非您明确编写 CROSS JOIN 或将 Spark 的默认值属性动态设置为 true Spark.conf.set("spark.sql.crossJoin.enabled", true)，否则您将运行进入异常错误。

错误出现在 TPCDS 查询 28,61、88 和 90 上，因为事务处理委员会 (TPC) 的原始查询语法包含逗号，而 Spark 的默认连接操作是内部连接。我的团队还决定使用 CROSS JOIN 来代替更改 Spark 的默认属性。

Answer 4

我觉得应该是

spark.conf.set("spark.sql.crossJoin.enabled", True)

否则给

NameError: name 'true' is not defined

spark.sql.crossJoin.enabled 对于 Spark 2.x

spark.sql.crossJoin.enabled for Spark 2.x

apache-spark

google-cloud-dataproc