Spark SQL: 为什么Spark不一直广播

Question

我在 aws s3 和 emr 上使用 Spark 2.4 进行一个项目，我有一个左连接，其中有两个很大的数据部分。 spark执行不稳定，内存问题经常失败

集群有10台m3.2xlarge类型的机器，每台机器有16个vCore，30GiB内存，160GB SSD存储。

我有这样的配置：

          "--executor-memory",
          "6512M",
          "--driver-memory",
          "12g",
          "--conf",
          "spark.driver.maxResultSize=4g",
          "--conf",
          "spark.sql.autoBroadcastJoinThreshold=1073741824",

left join发生在左侧150GB和右侧30GB左右之间，所以有很多shuffle。我的解决方案是将右侧削减到足够小，例如 1G，这样数据将被广播而不是随机播放。唯一的问题是在第一个左连接之后，左侧已经有了来自右侧的新列，所以接下来的左连接将有重复的列，比如 col1_right_1、col2_right_1、col1_right_2, col2_right_2 并且我必须将 col1_right_1/col1_right_2 重命名为 col1_left, col2_right_1/col2_right_2 重命名为 col2_left.

所以我想知道，为什么 Spark 允许 shuffle 发生，而不是到处使用广播。广播不应该总是比随机播放快吗？为什么 Spark 不像我说的那样加入，将一侧切成小块并广播？

Answer 1

让我们看看这两个选项。如果我理解正确，您正在为数据帧的每一块执行广播和连接，其中块的大小是最大广播阈值。这里的优点是您基本上只通过网络发送一个数据帧，但您正在执行多个连接。要执行的每个连接都有一个开销。 From:

Once the broadcasted Dataset is available on an executor machine, it is joined with each partition of the other Dataset. That is, for the values of the join columns for each row (in each partition) of the other Dataset, the corresponding row is fetched from the broadcasted Dataset and the join is performed.

这意味着对于每批广播连接，在每个分区中，您必须查看整个其他数据集并执行连接。

Sortmerge 或 hash join 必须执行洗牌（如果数据集不是等分的）但它们的连接效率更高。

Spark SQL: 为什么Spark不一直广播

Spark SQL: why does not Spark do broadcast all the time

apache-spark

pyspark-sql