pyspark 出现 AM 容器限制错误
pyspark erroring with a AM Container limit error
全部,
我们在 AKS (SQLServer 2019 BDC) 上有一个 Apache Spark v3.12 + Yarn。我们 运行 重构 python Pyspark 代码导致以下错误:
Application application_1635264473597_0181 failed 1 times (global
limit =2; local limit is =1) due to AM Container for
appattempt_1635264473597_0181_000001 exited with exitCode: -104
Failing this attempt.Diagnostics: [2021-11-12 15:00:16.915]Container
[pid=12990,containerID=container_1635264473597_0181_01_000001] is
running 7282688B beyond the 'PHYSICAL' memory limit. Current usage:
2.0 GB of 2 GB physical memory used; 4.9 GB of 4.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_1635264473597_0181_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES)
FULL_CMD_LINE
|- 13073 12999 12990 12990 (python3) 7333 112 1516236800 235753
/opt/bin/python3
/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp/3677222184783620782
|- 12999 12990 12990 12990 (java) 6266 586 3728748544 289538
/opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1
-Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp
-Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001
org.apache.spark.deploy.yarn.ApplicationMaster --class
org.apache.livy.rsc.driver.RSCDriverBootstrapper --properties-file
/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties
|- 12990 12987 12990 12990 (bash) 0 0 4304896 775 /bin/bash -c
/opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1
-Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp
-Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001
org.apache.spark.deploy.yarn.ApplicationMaster --class
'org.apache.livy.rsc.driver.RSCDriverBootstrapper' --properties-file
/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties
1>
/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stdout
2>
/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stderr
[2021-11-12 15:00:16.921]Container killed on request. Exit code is 143
[2021-11-12 15:00:16.940]Container exited with a non-zero exit code
143.
For more detailed output, check the application tracking page:
https://sparkhead-0.mssql-cluster.everestre.net:8090/cluster/app/application_1635264473597_0181 Then click on links to logs of each attempt.
. Failing the application.
默认设置如下,没有运行时间设置:
"settings": {
"spark-defaults-conf.spark.driver.cores": "1",
"spark-defaults-conf.spark.driver.memory": "1664m",
"spark-defaults-conf.spark.driver.memoryOverhead": "384",
"spark-defaults-conf.spark.executor.instances": "1",
"spark-defaults-conf.spark.executor.cores": "2",
"spark-defaults-conf.spark.executor.memory": "3712m",
"spark-defaults-conf.spark.executor.memoryOverhead": "384",
"yarn-site.yarn.nodemanager.resource.memory-mb": "12288",
"yarn-site.yarn.nodemanager.resource.cpu-vcores": "6",
"yarn-site.yarn.scheduler.maximum-allocation-mb": "12288",
"yarn-site.yarn.scheduler.maximum-allocation-vcores": "6",
"yarn-site.yarn.scheduler.capacity.maximum-am-resource-percent": "0.34".
}
AM Container是指Application Master Container还是Application Manager (of YARN)。如果是这种情况,那么在集群模式设置中,Driver 和 Application Master 运行 在同一个 Container 中?
我要更改什么运行时间参数才能成功生成 Pyspark 代码。
谢谢,
grajee
您可能没有更改任何设置 143 可能意味着很多事情,包括您 运行 内存不足。测试你是否 运行 内存不足。我会减少您使用的数据量,看看您的代码是否开始工作。如果是这样,您可能 运行 内存不足,应该考虑重构您的代码。一般来说,我建议在更改 spark 配置之前先尝试更改代码。
为了理解 spark driver 如何在 yarn 上工作,这里有一个合理的解释:https://sujithjay.com/spark/with-yarn
全部,
我们在 AKS (SQLServer 2019 BDC) 上有一个 Apache Spark v3.12 + Yarn。我们 运行 重构 python Pyspark 代码导致以下错误:
Application application_1635264473597_0181 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1635264473597_0181_000001 exited with exitCode: -104
Failing this attempt.Diagnostics: [2021-11-12 15:00:16.915]Container [pid=12990,containerID=container_1635264473597_0181_01_000001] is running 7282688B beyond the 'PHYSICAL' memory limit. Current usage: 2.0 GB of 2 GB physical memory used; 4.9 GB of 4.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_1635264473597_0181_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 13073 12999 12990 12990 (python3) 7333 112 1516236800 235753 /opt/bin/python3 /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp/3677222184783620782
|- 12999 12990 12990 12990 (java) 6266 586 3728748544 289538 /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.livy.rsc.driver.RSCDriverBootstrapper --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties
|- 12990 12987 12990 12990 (bash) 0 0 4304896 775 /bin/bash -c /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.livy.rsc.driver.RSCDriverBootstrapper' --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties 1> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stdout 2> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stderr
[2021-11-12 15:00:16.921]Container killed on request. Exit code is 143
[2021-11-12 15:00:16.940]Container exited with a non-zero exit code 143.
For more detailed output, check the application tracking page: https://sparkhead-0.mssql-cluster.everestre.net:8090/cluster/app/application_1635264473597_0181 Then click on links to logs of each attempt.
. Failing the application.
默认设置如下,没有运行时间设置:
"settings": {
"spark-defaults-conf.spark.driver.cores": "1",
"spark-defaults-conf.spark.driver.memory": "1664m",
"spark-defaults-conf.spark.driver.memoryOverhead": "384",
"spark-defaults-conf.spark.executor.instances": "1",
"spark-defaults-conf.spark.executor.cores": "2",
"spark-defaults-conf.spark.executor.memory": "3712m",
"spark-defaults-conf.spark.executor.memoryOverhead": "384",
"yarn-site.yarn.nodemanager.resource.memory-mb": "12288",
"yarn-site.yarn.nodemanager.resource.cpu-vcores": "6",
"yarn-site.yarn.scheduler.maximum-allocation-mb": "12288",
"yarn-site.yarn.scheduler.maximum-allocation-vcores": "6",
"yarn-site.yarn.scheduler.capacity.maximum-am-resource-percent": "0.34".
}
AM Container是指Application Master Container还是Application Manager (of YARN)。如果是这种情况,那么在集群模式设置中,Driver 和 Application Master 运行 在同一个 Container 中?
我要更改什么运行时间参数才能成功生成 Pyspark 代码。
谢谢,
grajee
您可能没有更改任何设置 143 可能意味着很多事情,包括您 运行 内存不足。测试你是否 运行 内存不足。我会减少您使用的数据量,看看您的代码是否开始工作。如果是这样,您可能 运行 内存不足,应该考虑重构您的代码。一般来说,我建议在更改 spark 配置之前先尝试更改代码。
为了理解 spark driver 如何在 yarn 上工作,这里有一个合理的解释:https://sujithjay.com/spark/with-yarn