docker 中的纱线 - __spark_libs__.zip 不存在

yarn in docker - __spark_libs__.zip does not exist

我浏览了 Whosebug post,但它们对我帮助不大。

我正在尝试让 Yarn 在现有集群上运行。到目前为止,我们一直在使用 spark standalone manger 作为我们的资源分配器,它一直按预期工作。

这是我们架构的基本概述。 docker 容器中白框 运行 中的所有内容。

master-machine 我可以 运行 从 yarn 资源管理器容器中执行以下命令并获得一个使用 yarn 的 spark-shell 运行ning:./pyspark --master yarn --driver-memory 1G --executor-memory 1G --executor-cores 1 --conf "spark.yarn.am.memory=1G"

但是,如果我在 jupyter 容器中尝试 运行 来自 client-machine 的相同命令,我会在 YARN-UI.

Application application_1512999329660_0001 failed 2 times due to AM 
Container for appattempt_1512999329660_0001_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://master-machine:5000/proxy/application_1512999329660_0001/Then, click on links to logs of each attempt.
Diagnostics: File file:/sparktmp/spark-58732bb2-f513-4aff-b1f0-27f0a8d79947/__spark_libs__5915104925224729874.zip does not exist
java.io.FileNotFoundException: File file:/sparktmp/spark-58732bb2-f513-4aff-b1f0-27f0a8d79947/__spark_libs__5915104925224729874.zip does not exist

我可以在 client-machine 上找到 file:/sparktmp/spark-58732bb2-f513-4aff-b1f0-27f0a8d79947/,但在 master machine

上找不到 spark-58732bb2-f513-4aff-b1f0-27f0a8d79947

请注意,当 spark-shell 指向 master machine 上的独立 spark 管理器时,它从 client-machine 开始工作。

也没有日志打印到工作机器上的 yarn 日志目录。

如果我 运行 在 spark/examples/src/main/python/pi.py 上提交 spark-submit,我会得到与上面相同的错误。

这是纱线-site.xml

<configuration>
  <property>
    <description>YARN hostname</description>
    <name>yarn.resourcemanager.hostname</name>
    <value>master-machine</value>
  </property>

  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
    <!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler</value> -->
    <!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> -->
  </property>

  <property>
    <description>The address of the RM web application.</description>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>${yarn.resourcemanager.hostname}:5000</value>
  </property>

  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>${yarn.resourcemanager.hostname}:8031</value>
  </property>

  <property>
    <description>The address of the scheduler interface.</description>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>${yarn.resourcemanager.hostname}:8030</value>
  </property>

  <property>
    <description>The address of the applications manager interface in the RM.</description>
    <name>yarn.resourcemanager.address</name>
    <value>${yarn.resourcemanager.hostname}:8032</value>
  </property>

  <property>
    <description>The address of the RM admin interface.</description>
    <name>yarn.resourcemanager.admin.address</name>
    <value>${yarn.resourcemanager.hostname}:8033</value>
  </property>

  <property>
    <description>Set to false, to avoid ip check</description>
    <name>hadoop.security.token.service.use_ip</name>
    <value>false</value>
  </property>

  <property>
    <name>yarn.scheduler.capacity.maximum-applications</name>
    <value>1000</value>
    <description>Maximum number of applications in the system which
      can be concurrently active both running and pending</description>
  </property>

  <property>
    <description>Whether to use preemption. Note that preemption is experimental
      in the current version. Defaults to false.</description>
    <name>yarn.scheduler.fair.preemption</name>
    <value>true</value>
  </property>

  <property>
    <description>Whether to allow multiple container assignments in one
      heartbeat. Defaults to false.</description>
    <name>yarn.scheduler.fair.assignmultiple</name>
    <value>true</value>
  </property>

  <property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>

</configuration>

这里是 spark.conf:

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# DRIVER PROPERTIES
spark.driver.port 7011
spark.fileserver.port 7021
spark.broadcast.port 7031
spark.replClassServer.port 7041
spark.akka.threads 6
spark.driver.cores 4
spark.driver.memory 32g
spark.master yarn
spark.deploy.mode client

# DRIVER AND EXECUTORS
spark.blockManager.port 7051

# EXECUTORS
spark.executor.port 7101

# GENERAL
spark.broadcast.factory=org.apache.spark.broadcast.HttpBroadcastFactory
spark.port.maxRetries 10
spark.local.dir /sparktmp
spark.scheduler.mode  FAIR

# SPARK UI
spark.ui.port 4140

# DYNAMIC ALLOCATION AND SHUFFLE SERVICE
# http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
spark.dynamicAllocation.enabled false
spark.shuffle.service.enabled false
spark.shuffle.service.port 7061
spark.dynamicAllocation.initialExecutors 5
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.maxExecutors 8
spark.dynamicAllocation.executorIdleTimeout 60s

# LOGGING
spark.executor.logs.rolling.maxRetainedFiles 5
spark.executor.logs.rolling.strategy size
spark.executor.logs.rolling.maxSize 100000000

# JMX
# Testing
# spark.driver.extraJavaOptions -Dcom.sun.management.jmxremote.port=8897 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

# Spark Yarn Configs
spark.hadoop.yarn.resourcemanager.address <master-machine IP>:8032
spark.hadoop.yarn.resourcemanager.hostname master-machine

并且这个 shell 脚本在所有机器上 运行:

# The main ones
export CONDA_DIR=/cluster/conda
export HADOOP_HOME=/usr/hadoop
export SPARK_HOME=/usr/spark
export JAVA_HOME=/usr/java/latest

export PATH=$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin:$JAVA_HOME/bin:$CONDA_DIR/bin:/cluster/libs-python:/cluster/batch
export PYTHONPATH=/cluster/libs-python:$SPARK_HOME/python:$PY4JPATH:$PYTHONPATH
export SPARK_CLASSPATH=/cluster/libs-java/*:/cluster/libs-python:$SPARK_CLASSPATH

# Core spark configuration
export PYSPARK_PYTHON="/cluster/conda/bin/python"
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_MASTER_WEBUI_PORT=7080
export SPARK_WORKER_WEBUI_PORT=7081
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Duser.timezone=UTC+02:00"
export SPARK_WORKER_DIR="/sparktmp"
export SPARK_WORKER_CORES=22
export SPARK_WORKER_MEMORY=43G
export SPARK_DAEMON_MEMORY=1G
export SPARK_WORKER_INSTANCEs=1
export SPARK_EXECUTOR_INSTANCES=2
export SPARK_EXECUTOR_MEMORY=4G
export SPARK_EXECUTOR_CORES=2
export SPARK_LOCAL_IP=$(hostname -I | cut -f1 -d " ")
export SPARK_PUBLIC_DNS=$(hostname -I | cut -f1 -d " ")
export SPARK_MASTER_OPTS="-Duser.timezone=UTC+02:00"

这是 master-machine(namenodes) 上的 hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/hdfs</value>
    </property>
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>/hdfs/name</value>
   </property>
   <property>
      <name>dfs.replication</name>
      <value>2</value>
   </property>
   <property>
      <name>dfs.replication.max</name>
      <value>3</value>
   </property>
   <property>
      <name>dfs.replication.min</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.permissions.superusergroup</name>
      <value>supergroup</value>
   </property>

   <property>
     <name>dfs.blocksize</name>
     <value>268435456</value>
   </property>

   <property>
     <name>dfs.permissions.enabled</name>
     <value>true</value>
   </property>

   <property>
     <name>fs.permissions.umask-mode</name>
     <value>002</value>
   </property>

  <property>
    <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
    <value>false</value>
  </property>

  <property>
  <!-- 1000Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>125000000</value>
  </property>

  <property>
    <name>dfs.hosts.exclude</name>
    <value>/cluster/config/hadoopconf/namenode/dfs.hosts.exclude</value>
    <final>true</final>
  </property>

  <property>
    <name>dfs.namenode.replication.work.multiplier.per.iteration</name>
    <value>10</value>
  </property>

  <property>
    <name>dfs.namenode.replication.max-streams</name>
    <value>50</value>
  </property>

  <property>
    <name>dfs.namenode.replication.max-streams-hard-limit</name>
    <value>100</value>
  </property>

</configuration>

这是工作机器(数据节点)上的 hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/hdfs,/hdfs2,/hdfs3</value>
    </property>
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>/hdfs/name</value>
   </property>
   <property>
      <name>dfs.replication</name>
      <value>2</value>
   </property>
   <property>
      <name>dfs.replication.max</name>
      <value>3</value>
   </property>
   <property>
      <name>dfs.replication.min</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.permissions.superusergroup</name>
      <value>supergroup</value>
   </property>

   <property>
     <name>dfs.blocksize</name>
     <value>268435456</value>
   </property>

   <property>
     <name>dfs.permissions.enabled</name>
     <value>true</value>
   </property>

   <property>
     <name>fs.permissions.umask-mode</name>
     <value>002</value>
   </property>

   <property>
   <!-- 1000Mbit/s -->
     <name>dfs.balance.bandwidthPerSec</name>
     <value>125000000</value>
   </property>
</configuration>

这是工作机器(数据节点)上的核心-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master-machine:54310/</value>
  </property>
</configuration>

这是主控机器(名称节点)上的核心-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master-machine:54310/</value>
  </property>
</configuration>

经过大量调试后,我发现由于某种原因,jupyter 容器没有在正确的 hadoop conf 目录中查找,即使 HADOOP_HOME 环境变量指向正确的位置。要解决上述问题,我所要做的就是将 HADOOP_CONF_DIR 指向正确的目录,然后一切又开始工作了。