PySpark 内核 (JupyterHub) 运行 可以在 yarn-client 模式下吗?
Can a PySpark Kernel(JupyterHub) run in yarn-client mode?
我当前的设置:
- 带有 HDFS 和 YARN 的 Spark EC2 集群
- JuputerHub(0.7.0)
- PySpark 内核 python27
我为这个问题使用的非常简单的代码:
rdd = sc.parallelize([1, 2])
rdd.collect()
在 Spark Standalone 中按预期工作的 PySpark 内核在内核 json 文件中具有以下环境变量:
"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"
但是,当我尝试在 yarn-client 模式下 运行 时,它永远卡住了,而 JupyerHub 日志的日志输出是:
16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
如所述here 我添加了 HADOOP_CONF_DIR 环境。变量指向 Hadoop 配置所在的目录,并将 PYSPARK_SUBMIT_ARGS --master
属性 更改为“yarn-client”。此外,我可以确认在此期间没有其他工作 运行ning,并且工人已正确注册。
我的印象是可以将带有 PySpark 内核的 JupyterHub Notebook 配置为 运行,YARN 为 other people have done it,如果确实如此,那我做错了?
为了让你的 pyspark 在 yarn 模式下工作,你必须做一些额外的配置:
通过复制远程纱线连接配置纱线
hadoop-yarn-server-web-proxy-<version>.jar
你的纱线集群在你的 jupyter 实例的 <local hadoop directory>/hadoop-<version>/share/hadoop/yarn/
中(你需要一个本地 hadoop)
将集群的 hive-site.xml
复制到 <local spark directory>/spark-<version>/conf/
将集群的 yarn-site.xml
复制到 <local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/
设置环境变量:
export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
export SPARK_HOME=<local spark directory>/spark-<version>
export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
现在,您可以在文件 /usr/local/share/jupyter/kernels/pyspark/kernel.json
中创建内核
{
"display_name": "pySpark (Spark 2.1.0)",
"language": "python",
"argv": [
"/opt/conda/envs/python35/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python",
"SPARK_HOME": "/opt/mapr/spark/spark-2.1.0",
"PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/",
"PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell"
}
}
重新启动您的 jupyterhub,您应该会看到 pyspark。由于 uid=1,root 用户通常没有 yarn 权限。您应该与另一个用户连接到 jupyterhub
希望my case能帮到你
我通过简单地传递一个参数来配置url:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")
我当前的设置:
- 带有 HDFS 和 YARN 的 Spark EC2 集群
- JuputerHub(0.7.0)
- PySpark 内核 python27
我为这个问题使用的非常简单的代码:
rdd = sc.parallelize([1, 2])
rdd.collect()
在 Spark Standalone 中按预期工作的 PySpark 内核在内核 json 文件中具有以下环境变量:
"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"
但是,当我尝试在 yarn-client 模式下 运行 时,它永远卡住了,而 JupyerHub 日志的日志输出是:
16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
如所述here 我添加了 HADOOP_CONF_DIR 环境。变量指向 Hadoop 配置所在的目录,并将 PYSPARK_SUBMIT_ARGS --master
属性 更改为“yarn-client”。此外,我可以确认在此期间没有其他工作 运行ning,并且工人已正确注册。
我的印象是可以将带有 PySpark 内核的 JupyterHub Notebook 配置为 运行,YARN 为 other people have done it,如果确实如此,那我做错了?
为了让你的 pyspark 在 yarn 模式下工作,你必须做一些额外的配置:
通过复制远程纱线连接配置纱线
hadoop-yarn-server-web-proxy-<version>.jar
你的纱线集群在你的 jupyter 实例的<local hadoop directory>/hadoop-<version>/share/hadoop/yarn/
中(你需要一个本地 hadoop)将集群的
hive-site.xml
复制到<local spark directory>/spark-<version>/conf/
将集群的
yarn-site.xml
复制到<local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/
设置环境变量:
export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
export SPARK_HOME=<local spark directory>/spark-<version>
export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
现在,您可以在文件
中创建内核/usr/local/share/jupyter/kernels/pyspark/kernel.json
{ "display_name": "pySpark (Spark 2.1.0)", "language": "python", "argv": [ "/opt/conda/envs/python35/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python", "SPARK_HOME": "/opt/mapr/spark/spark-2.1.0", "PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/", "PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell" } }
重新启动您的 jupyterhub,您应该会看到 pyspark。由于 uid=1,root 用户通常没有 yarn 权限。您应该与另一个用户连接到 jupyterhub
希望my case能帮到你
我通过简单地传递一个参数来配置url:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")