当使用 spark-submit yarn-client 而不是 spark-submit local 时，Jupyter notebook 慢了 3 倍

Question

我正在使用 Hadoop (Cloudera) 数据湖：

Spark 1.6.0
Python 3.5.2
IPython 4.1.1（笔记本服务器是 4.1.0）

我正在执行完全相同的代码。

计算 4M 条目 Hive 上的条目数 table
计算 70M 条目 Hive 上的条目数 table
复杂 SQL 查询，在 Hive 上有很多连接 tables

我正在尝试了解两者之间的区别：

运行 Hadoop 集群（"spark-submit --master yarn-cluster" 使用 Oozie）
运行Hadoop集群的边缘节点("spark-submit --master local")
使用 Jupyter notebook ("spark-submit --master yarn-client")
使用 Jupyter notebook ("spark-submit --master local")

请在下面找到测试

---------------------------------------------------------------------------------------------------
|  test                                | wall time code 1 | wall time code 2 | wall time code 3   |
---------------------------------------------------------------------------------------------------
|  Oozie (spark-submit yarn-cluster)   |  14.50 s         | 010.69 s          |  085.74 s         |
---------------------------------------------------------------------------------------------------
|  edge node (spark-submit yarn-client)|  12.93 s         | 008.91 s          |  122.12 s         |
---------------------------------------------------------------------------------------------------
|  edge node (spark-submit local)      |  05.15 s         | 019.05 s          |  414.68 s         |
---------------------------------------------------------------------------------------------------
|  Jupyter (spark-submit yarn-client)  |  15.30 s         | 145.77 s          |  986.71 s         |
---------------------------------------------------------------------------------------------------
|  Jupyter (spark-submit local)        |  05.89 s         | 021.46 s          |  385.66 s         |
---------------------------------------------------------------------------------------------------

对我来说，使用 YARN 集群比使用边缘节点和 API 访问集群上的数据时，结果计算机的速度要快得多。我不明白为什么使用 Jupyter Notebook 我看到 2.5 倍之间的因数：

使用 Jupyter notebook ("spark-submit --master yarn-client")
使用 Jupyter notebook ("spark-submit --master local")

我希望使用 Jupyter notebook "spark-submit --master yarn-client" 比 "spark-submit --master local" 快。什么可以解释这种差异？

JupyterHub 背后的身份验证步骤和代理转换和多用户管理？

如何检查我们的配置有什么问题？关于如何进行最佳设置的任何文档？很高兴使用 Jupyter Notebook 进行快速数据探索，所以我想了解其中的差异。

我的 Spark 设置如下

spark-submit --master yarn-cluster 
--files /etc/hive/conf/hive-site.xml 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/opt/cloudera/extras/anaconda3/bin/python3 
--conf spark.ui.enabled=false 
--conf spark.yarn.security.tokens.hive.enabled=false 
--conf spark.yarn.executor.memoryOverhead=6144 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
--conf spark.io.compression.codec=snappy 
--conf spark.speculation=true 
--conf spark.shuffle.manager=sort 
--conf spark.shuffle.service.enabled=true 
--conf spark.dynamicAllocation.enabled=true 
--conf spark.dynamicAllocation.initialExecutors=4 
--conf spark.driver.maxResultSize=10g 
--conf spark.dynamicAllocation.minExecutors=2 
--conf spark.executor.cores=4 
--conf spark.dynamicAllocation.maxExecutors=20 
--conf spark.executor.memory=10g 
--conf spark.driver.memory=10g 
--conf spark.driver.extraJavaOptions=-Xms10g 
--conf spark.akka.frameSize=2047 
--conf spark.kryoserializer.buffer.max=2047mb 
testpyspark.py

Answer 1

此问题与 Cloudera 5.8 有关。在 Jupyter 中执行的 Python 版本与 HDFS 名称节点之间存在兼容性问题。

解决方案是在内核中将 PATH 添加到 python 2，位于内核本身所需的所有其他路径之上：

"PATH":"/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/cloudera/extras/anaconda2/bin/" =10=]

当使用 spark-submit yarn-client 而不是 spark-submit local 时，Jupyter notebook 慢了 3 倍

Jupyter notebook is 3x slower when using spark-submit yarn-client instead of spark-submit local

client

local

hadoop-yarn

pyspark

jupyter-notebook