sc.broadcast 和 spark sql 中广播函数的区别
Difference between sc.broadcast and broadcast function in spark sql
我已使用 sc.broadcast
查找文件以提高性能。
我还了解到 Spark SQL 函数中有一个名为 broadcast
的函数。
两者有什么区别?
我应该使用哪一个来广播 reference/look up 表?
如果你想在 Spark SQL 中实现广播连接,你应该使用 broadcast
函数(结合所需的 spark.sql.autoBroadcastJoinThreshold
配置)。它将:
- 标记给定的广播关系。
- 调整SQL执行计划。
- 评估输出关系时,它将负责收集数据、广播和应用正确的连接机制。
SparkContext.broadcast
用于处理本地对象,适用于Spark DataFrames
.
一句话回答:
1) org.apache.spark.sql.functions.broadcast()
函数由用户提供,给定 sql 连接的明确提示。
2) sc.broadcast
用于广播只读共享变量。
有关 broadcast
函数 #1 的更多详细信息:
这是来自
sql/execution/SparkStrategies.scala
这就是说。
- Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable
[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that
side has an explicit broadcast hint (e.g. the user applied the *
[[org.apache.spark.sql.functions.broadcast()]] function to a
DataFrame), then that side * of the join will be broadcasted
and the other side will be streamed, with no shuffling *
performed. If both sides of the join are eligible to be broadcasted
then the *
- Shuffle hash join: if the average size of a single
partition is small enough to build a hash * table.
- Sort merge: if the matching join keys are sortable.
- If there is no joining keys, Join implementations are chosen with the following precedence:
- BroadcastNestedLoopJoin: if one side of the join could be broadcasted
- CartesianProduct: for Inner join
- BroadcastNestedLoopJoin
- 下面的方法根据我们设置的大小控制行为
spark.sql.autoBroadcastJoinThreshold
默认为 10mb
Note : smallDataFrame.join(largeDataFrame)
does not do a broadcast hash join, but largeDataFrame.join(smallDataFrame)
does.
/** Matches a plan whose output should be small enough to be used in broadcast join.
**/
private def canBroadcast(plan: LogicalPlan): Boolean = {
plan.statistics.isBroadcastable ||
plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
}
以后below configurations will be deprecated in coming versions of spark。
我已使用 sc.broadcast
查找文件以提高性能。
我还了解到 Spark SQL 函数中有一个名为 broadcast
的函数。
两者有什么区别?
我应该使用哪一个来广播 reference/look up 表?
如果你想在 Spark SQL 中实现广播连接,你应该使用 broadcast
函数(结合所需的 spark.sql.autoBroadcastJoinThreshold
配置)。它将:
- 标记给定的广播关系。
- 调整SQL执行计划。
- 评估输出关系时,它将负责收集数据、广播和应用正确的连接机制。
SparkContext.broadcast
用于处理本地对象,适用于Spark DataFrames
.
一句话回答:
1) org.apache.spark.sql.functions.broadcast()
函数由用户提供,给定 sql 连接的明确提示。
2) sc.broadcast
用于广播只读共享变量。
有关 broadcast
函数 #1 的更多详细信息:
这是来自
sql/execution/SparkStrategies.scala
这就是说。
- Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that side has an explicit broadcast hint (e.g. the user applied the *
[[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side * of the join will be broadcasted and the other side will be streamed, with no shuffling *
performed. If both sides of the join are eligible to be broadcasted then the *- Shuffle hash join: if the average size of a single partition is small enough to build a hash * table.
- Sort merge: if the matching join keys are sortable.
- If there is no joining keys, Join implementations are chosen with the following precedence:
- BroadcastNestedLoopJoin: if one side of the join could be broadcasted
- CartesianProduct: for Inner join
- BroadcastNestedLoopJoin
- 下面的方法根据我们设置的大小控制行为
spark.sql.autoBroadcastJoinThreshold
默认为 10mb
Note :
smallDataFrame.join(largeDataFrame)
does not do a broadcast hash join, butlargeDataFrame.join(smallDataFrame)
does.
/** Matches a plan whose output should be small enough to be used in broadcast join.
**/
private def canBroadcast(plan: LogicalPlan): Boolean = {
plan.statistics.isBroadcastable ||
plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
}
以后below configurations will be deprecated in coming versions of spark。