增加 Java Spark 上的内存以构建大型哈希关系
Increase Java Memory on Spark for Building Large Hash Relations
我目前正在尝试 运行 SnappyData 上的 TPC-H 查询。起初查询给了我一个错误说
ERROR 38000: (SQLState=38000 Severity=-1)
(Server=localhost[1528],Thread[DRDAConnThread_29,5,gemfirexd.daemons])
The exception 'Both sides of this join are outside the broadcasting
threshold and computing it could be prohibitively expensive. To
explicitly enable it, please set spark.sql.crossJoin.enabled = true;'
was thrown while evaluating an expression.
启用 spark 的 sql 交叉连接并重新运行查询后,弹出错误消息:
java.lang.RuntimeException: Can't acquire 1049600 bytes memory to build hash relation, got 74332 bytes
at org.apache.spark.sql.execution.joins.HashedRelationCache$.get(LocalJoin.scala:621)
at org.apache.spark.sql.execution.joins.HashedRelationCache.get(LocalJoin.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown Source)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun.apply(WholeStageCodegenExec.scala:367)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun.apply(WholeStageCodegenExec.scala:364)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$$anonfun$apply.apply(RDD.scala:820)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$$anonfun$apply.apply(RDD.scala:820)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Can't acquire 1049600 bytes memory to build hash relation, got 74332 bytes
at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.ensureAcquireMemory(HashedRelation.scala:414)
at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.init(HashedRelation.scala:424)
at org.apache.spark.sql.ex
请告诉我如何增加构建散列关系的内存量。
以防万一,下面是查询,我正尝试在 1GB 数据集上 运行 它(我确实在空数据集上尝试过查询,它确实有效)。
TPC-H 查询 16:
SELECT i_name,
substr(i_data, 1, 3) AS brand,
i_price,
count(DISTINCT (pmod((s_w_id * s_i_id),10000))) AS supplier_cnt
FROM stock,
item
WHERE i_id = s_i_id
AND i_data NOT LIKE 'zz%'
AND (pmod((s_w_id * s_i_id),10000) NOT IN
(SELECT su_suppkey
FROM supplier
WHERE su_comment LIKE '%bad%'))
GROUP BY i_name,
substr(i_data, 1, 3),
i_price
ORDER BY supplier_cnt DESC;
在服务器配置文件中 (conf/servers) 将 jvm 内存设置为 -J-Xms5g
所以 conf/server 看起来像
localhost -locators=localhost:10334 -J-Xms5g
我目前正在尝试 运行 SnappyData 上的 TPC-H 查询。起初查询给了我一个错误说
ERROR 38000: (SQLState=38000 Severity=-1) (Server=localhost[1528],Thread[DRDAConnThread_29,5,gemfirexd.daemons]) The exception 'Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive. To explicitly enable it, please set spark.sql.crossJoin.enabled = true;' was thrown while evaluating an expression.
启用 spark 的 sql 交叉连接并重新运行查询后,弹出错误消息:
java.lang.RuntimeException: Can't acquire 1049600 bytes memory to build hash relation, got 74332 bytes
at org.apache.spark.sql.execution.joins.HashedRelationCache$.get(LocalJoin.scala:621)
at org.apache.spark.sql.execution.joins.HashedRelationCache.get(LocalJoin.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown Source)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun.apply(WholeStageCodegenExec.scala:367)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun.apply(WholeStageCodegenExec.scala:364)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$$anonfun$apply.apply(RDD.scala:820)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$$anonfun$apply.apply(RDD.scala:820)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Can't acquire 1049600 bytes memory to build hash relation, got 74332 bytes
at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.ensureAcquireMemory(HashedRelation.scala:414)
at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.init(HashedRelation.scala:424)
at org.apache.spark.sql.ex
请告诉我如何增加构建散列关系的内存量。
以防万一,下面是查询,我正尝试在 1GB 数据集上 运行 它(我确实在空数据集上尝试过查询,它确实有效)。 TPC-H 查询 16:
SELECT i_name,
substr(i_data, 1, 3) AS brand,
i_price,
count(DISTINCT (pmod((s_w_id * s_i_id),10000))) AS supplier_cnt
FROM stock,
item
WHERE i_id = s_i_id
AND i_data NOT LIKE 'zz%'
AND (pmod((s_w_id * s_i_id),10000) NOT IN
(SELECT su_suppkey
FROM supplier
WHERE su_comment LIKE '%bad%'))
GROUP BY i_name,
substr(i_data, 1, 3),
i_price
ORDER BY supplier_cnt DESC;
在服务器配置文件中 (conf/servers) 将 jvm 内存设置为 -J-Xms5g 所以 conf/server 看起来像 localhost -locators=localhost:10334 -J-Xms5g