sc.broadcast 和 spark sql 中广播函数的区别

Difference between sc.broadcast and broadcast function in spark sql

我已使用 sc.broadcast 查找文件以提高性能。

我还了解到 Spark SQL 函数中有一个名为 broadcast 的函数。

两者有什么区别?

我应该使用哪一个来广播 reference/look up 表?

如果你想在 Spark SQL 中实现广播连接,你应该使用 broadcast 函数(结合所需的 spark.sql.autoBroadcastJoinThreshold 配置)。它将:

  • 标记给定的广播关系。
  • 调整SQL执行计划。
  • 评估输出关系时,它将负责收集数据、广播和应用正确的连接机制。

SparkContext.broadcast用于处理本地对象,适用于Spark DataFrames.

一句话回答:

1) org.apache.spark.sql.functions.broadcast() 函数由用户提供,给定 sql 连接的明确提示。

2) sc.broadcast 用于广播只读共享变量。


有关 broadcast 函数 #1 的更多详细信息:

这是来自 sql/execution/SparkStrategies.scala

这就是说。

  • Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that side has an explicit broadcast hint (e.g. the user applied the *
    [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side * of the join will be broadcasted and the other side will be streamed, with no shuffling *
    performed. If both sides of the join are eligible to be broadcasted then the *
  • Shuffle hash join: if the average size of a single partition is small enough to build a hash * table.
  • Sort merge: if the matching join keys are sortable.
  • If there is no joining keys, Join implementations are chosen with the following precedence:
    • BroadcastNestedLoopJoin: if one side of the join could be broadcasted
    • CartesianProduct: for Inner join
    • BroadcastNestedLoopJoin
  • 下面的方法根据我们设置的大小控制行为 spark.sql.autoBroadcastJoinThreshold 默认为 10mb

Note : smallDataFrame.join(largeDataFrame) does not do a broadcast hash join, but largeDataFrame.join(smallDataFrame) does.

/** Matches a plan whose output should be small enough to be used in broadcast join.
         **/
        private def canBroadcast(plan: LogicalPlan): Boolean = {
          plan.statistics.isBroadcastable ||
            plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
        }

以后below configurations will be deprecated in coming versions of spark