PySpark 内核 (JupyterHub) 运行 可以在 yarn-client 模式下吗?

Can a PySpark Kernel(JupyterHub) run in yarn-client mode?

我当前的设置:

我为这个问题使用的非常简单的代码:

rdd = sc.parallelize([1, 2])
rdd.collect()

在 Spark Standalone 中按预期工作的 PySpark 内核在内核 json 文件中具有以下环境变量:

"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"

但是,当我尝试在 yarn-client 模式下 运行 时,它永远卡住了,而 JupyerHub 日志的日志输出是:

16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

如所述here 我添加了 HADOOP_CONF_DIR 环境。变量指向 Hadoop 配置所在的目录,并将 PYSPARK_SUBMIT_ARGS --master 属性 更改为“yarn-client”。此外,我可以确认在此期间没有其他工作 运行ning,并且工人已正确注册。

我的印象是可以将带有 PySpark 内核的 JupyterHub Notebook 配置为 运行,YARN 为 other people have done it,如果确实如此,那我做错了?

为了让你的 pyspark 在 yarn 模式下工作,你必须做一些额外的配置:

  1. 通过复制远程纱线连接配置纱线 hadoop-yarn-server-web-proxy-<version>.jar 你的纱线集群在你的 jupyter 实例的 <local hadoop directory>/hadoop-<version>/share/hadoop/yarn/ 中(你需要一个本地 hadoop)

  2. 将集群的 hive-site.xml 复制到 <local spark directory>/spark-<version>/conf/

  3. 将集群的 yarn-site.xml 复制到 <local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/

  4. 设置环境变量:

    • export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
    • export SPARK_HOME=<local spark directory>/spark-<version>
    • export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
    • export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
  5. 现在,您可以在文件 /usr/local/share/jupyter/kernels/pyspark/kernel.json

    中创建内核
     {
        "display_name": "pySpark (Spark 2.1.0)",
         "language": "python",
         "argv": [
          "/opt/conda/envs/python35/bin/python",
          "-m",
          "ipykernel",
          "-f",
          "{connection_file}"
         ],
         "env": {
          "PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python",
          "SPARK_HOME": "/opt/mapr/spark/spark-2.1.0",
          "PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/",
          "PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py",
          "PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell"
         }
        }
    
  6. 重新启动您的 jupyterhub,您应该会看到 pyspark。由于 uid=1,root 用户通常没有 yarn 权限。您应该与另一个用户连接到 jupyterhub

希望my case能帮到你

我通过简单地传递一个参数来配置url:

import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")