Spark 中广播对象的最大大小是多少?
What is the maximum size for a broadcast object in Spark?
使用 Dataframe broadcast function or the SparkContext broadcast 函数时,可以分派给所有执行程序的最大对象大小是多少?
broadcast
function :
默认为 10mb,但我们一直使用到 300 mb,这是由 spark.sql.autoBroadcastJoinThreshold 控制的。
AFAIK,这完全取决于可用内存。所以对此没有明确的答案。我要说的是,它应该小于大型数据框,您可以像下面这样估计大型或小型数据框的大小...
import org.apache.spark.util.SizeEstimator
logInfo(SizeEstimator.estimate(yourlargeorsmalldataframehere))
基于此,您可以将 broadcast
提示传递给框架。
也看看
scala 文档来自
sql/execution/SparkStrategies.scala
它说....
- Broadcast: if one side of the join has an estimated physical size that is smaller than the user-configurable
[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold or if that
side has an explicit broadcast hint (e.g. the user applied the
[[org.apache.spark.sql.functions.broadcast()]] function to a
DataFrame), then that side of the join will be broadcasted
and the other side will be streamed, with no shuffling
performed. If both sides are below the
threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.
- Shuffle hash join: if the average size of a single
partition is small enough to build a hash table.
- Sort merge: if the matching join keys are sortable.
- If there is no joining keys, Join implementations are chosen with the following precedence:
- BroadcastNestedLoopJoin: if one side of the join could be broadcasted
- CartesianProduct: for Inner join
- BroadcastNestedLoopJoin
也看看other-configuration-options
SparkContext.broadcast
(激流广播):
广播共享变量还有一个属性spark.broadcast.blockSize=4M
据我所知,我也没有看到硬核限制...
如需更多信息,请。见 TorrentBroadcast.scala
编辑:
但是您可以查看 2GB 问题,即使这在文档中没有正式声明(我在文档中看不到任何此类内容)。
请看 SPARK-6235 which is "IN PROGRESS" state & SPARK-6235_Design_V0.02.pdf .
从 Spark 2.4 开始,上限为 8 GB。 Source Code
更新:
8GB 限制对 Spark 3.2.1 仍然有效Source Code
如上所述,上限为8GB。但是当你有多个文件要广播时,spark 将所有数据文件推送到驱动程序。驱动程序加入这些文件并推送到执行程序节点。在此过程中,如果驱动程序的可用内存小于合并的广播文件,则会出现内存不足的错误。
使用 Dataframe broadcast function or the SparkContext broadcast 函数时,可以分派给所有执行程序的最大对象大小是多少?
broadcast
function :
默认为 10mb,但我们一直使用到 300 mb,这是由 spark.sql.autoBroadcastJoinThreshold 控制的。
AFAIK,这完全取决于可用内存。所以对此没有明确的答案。我要说的是,它应该小于大型数据框,您可以像下面这样估计大型或小型数据框的大小...
import org.apache.spark.util.SizeEstimator
logInfo(SizeEstimator.estimate(yourlargeorsmalldataframehere))
基于此,您可以将 broadcast
提示传递给框架。
也看看 scala 文档来自 sql/execution/SparkStrategies.scala
它说....
- Broadcast: if one side of the join has an estimated physical size that is smaller than the user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold or if that side has an explicit broadcast hint (e.g. the user applied the
[[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side of the join will be broadcasted and the other side will be streamed, with no shuffling
performed. If both sides are below the threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.- Shuffle hash join: if the average size of a single partition is small enough to build a hash table.
- Sort merge: if the matching join keys are sortable.
- If there is no joining keys, Join implementations are chosen with the following precedence:
- BroadcastNestedLoopJoin: if one side of the join could be broadcasted
- CartesianProduct: for Inner join
- BroadcastNestedLoopJoin
也看看other-configuration-options
SparkContext.broadcast
(激流广播):
广播共享变量还有一个属性spark.broadcast.blockSize=4M
据我所知,我也没有看到硬核限制...
如需更多信息,请。见 TorrentBroadcast.scala
编辑:
但是您可以查看 2GB 问题,即使这在文档中没有正式声明(我在文档中看不到任何此类内容)。 请看 SPARK-6235 which is "IN PROGRESS" state & SPARK-6235_Design_V0.02.pdf .
从 Spark 2.4 开始,上限为 8 GB。 Source Code
更新: 8GB 限制对 Spark 3.2.1 仍然有效Source Code
如上所述,上限为8GB。但是当你有多个文件要广播时,spark 将所有数据文件推送到驱动程序。驱动程序加入这些文件并推送到执行程序节点。在此过程中,如果驱动程序的可用内存小于合并的广播文件,则会出现内存不足的错误。