Spark Worker 要求荒谬的虚拟内存量

Spark Worker asking for absurd amounts of virtual memory

我是 运行 2 节点 yarn 集群上的 spark 作业。我的数据集并不大(< 100MB)仅用于测试,工作人员正在被杀死,因为它要求太多的虚拟内存。这里的数量是荒谬的。使用了 11GB 物理内存中的 2GB,使用了 300GB 虚拟内存。

16/02/12 05:49:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.1 (TID 22, ip-172-31-6-141.ec2.internal): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_1455246675722_0023_01_000003 on host: ip-172-31-6-141.ec2.internal. Exit status: 143. Diagnostics: Container [pid=23206,containerID=container_1455246675722_0023_01_000003] is running beyond virtual memory limits. Current usage: 2.1 GB of 11 GB physical memory used; 305.3 GB of 23.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1455246675722_0023_01_000003 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23292 23213 23292 23206 (python) 15 3 101298176 5514 python -m pyspark.daemon |- 23206 1659 23206 23206 (bash) 0 0 11431936 352 /bin/bash -c /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms10240m -Xmx10240m -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/tmp '-Dspark.driver.port=37386' -Dspark.yarn.app.container.log.dir=/mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.31.0.92:37386 --executor-id 2 --hostname ip-172-31-6-141.ec2.internal --cores 8 --app-id application_1455246675722_0023 --user-class-path file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/app.jar 1> /mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003/stdout 2> /mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003/stderr |- 23341 23292 23292 23206 (python) 87 8 39464374272 23281 python -m pyspark.daemon |- 23350 23292 23292 23206 (python) 86 7 39463976960 24680 python -m pyspark.daemon |- 23329 23292 23292 23206 (python) 90 6 39464521728 23281 python -m pyspark.daemon |- 23213 23206 23206 23206 (java) 1168 61 11967115264 359820 /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms10240m -Xmx10240m -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/tmp -Dspark.driver.port=37386 -Dspark.yarn.app.container.log.dir=/mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.31.0.92:37386 --executor-id 2 --hostname ip-172-31-6-141.ec2.internal --cores 8 --app-id application_1455246675722_0023 --user-class-path file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/app.jar |- 23347 23292 23292 23206 (python) 87 10 39464783872 23393 python -m pyspark.daemon |- 23335 23292 23292 23206 (python) 83 9 39464112128 23216 python -m pyspark.daemon |- 23338 23292 23292 23206 (python) 81 9 39463714816 24614 python -m pyspark.daemon |- 23332 23292 23292 23206 (python) 86 6 39464374272 24812 python -m pyspark.daemon |- 23344 23292 23292 23206 (python) 85 30 39464374272 23281 python -m pyspark.daemon Container killed on request. Exit code is 143

有谁知道为什么会这样?我一直在尝试修改各种 yarn 和 spark 配置,但我知道它要求这么多 vmem 是非常错误的。

我运行宁的命令是

spark-submit --executor-cores 8 ...

原来 executor-cores 标志并没有像我想的那样。它制作了 8 个 pyspark.daemon 进程副本,运行ning 8 个工作进程副本到 运行 个作业。每个进程都使用 38GB 的​​虚拟内存,这是不必要的大,但是 8 * 38 ~ 300,所以可以解释。

这实际上是一个命名非常糟糕的标志。如果我将 executor-cores 设置为 1,它会创建一个守护进程,但守护进程将使用多个内核,如 htop 所示。