运行 python 纱线集群模式下的 spark 作业

Question

当我在spark python的示例中使用spark 运行 pi.py脚本时出现问题，当我使用yarn-client模式时，一切正常。但是当我使用 yarn-cluster 模式时，作业无法启动，并且容器 return 出现这样的语法错误：

日志类型：标准输出

日志上传 Time:Thu 5 月 21 日 08:48:16 +0800 2015

日志长度:111

日志内容：

文件 "pi.py"，第 40 行

return 1 if x ** 2 + y ** 2 < 1 else 0

我确定脚本是正确的，任何人都可以帮助我。

Answer 1

注意到 Python 的新版本中包含语法错误功能，所以我认为这可能是 Spark 使用的 Python 版本的问题。

我在

中添加了一个属性

/etc/spark/conf.cloudera.spark_on_yarn/spark-defaults.conf:
spark.yarn.appMasterEnv.PYSPARK_PYTHON

指定Python二进制路径。

Answer 2

spark 目前不支持运行 python 集群模式下的脚本（将驱动程序部署到集群）

Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters or Python applications.

https://spark.apache.org/docs/1.3.1/submitting-applications.html

运行 python 纱线集群模式下的 spark 作业

Run a python spark job in yarn-cluster mode

python

hadoop-yarn

apache-spark