在没有安装 Hadoop 的情况下在 Spark 上提交 .py 脚本

submit .py script on Spark without Hadoop installation

我有以下简单的 wordcount Python 脚本。

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)

from operator import add
f=sc.textFile("C:/Spark/spark-1.2.0/README.md")
wc=f.flatMap(lambda x: x.split(" ")).map(lambda x: (x,1)).reduceByKey(add)
print wc
wc.saveAsTextFile("wc_out.txt")

我正在使用此命令行启动此脚本:

spark-submit "C:/Users/Alexis/Desktop/SparkTest.py"

我收到以下错误:

Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true
15/04/20 18:58:01 WARN Utils: Your hostname, AE-LenovoUltra resolves to a loopba
ck address: 127.0.1.2; using 192.168.1.63 instead (on interface net0)
15/04/20 18:58:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
15/04/20 18:58:10 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
15/04/20 18:58:11 ERROR Shell: Failed to locate the winutils binary in the hadoo
p binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Ha
doop binaries.
        at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
        at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
        at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
        at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:867)
        at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
        at org.apache.spark.util.Utils$.fetchFile(Utils.scala:411)
        at org.apache.spark.SparkContext.addFile(SparkContext.scala:969)
        at org.apache.spark.SparkContext$$anonfun.apply(SparkContext.scala:28
0)
        at org.apache.spark.SparkContext$$anonfun.apply(SparkContext.scala:28
0)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:280)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.sc
ala:61)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:214)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
.java:79)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
  File "C:/Users/Alexis/Desktop/SparkTest.py", line 3, in <module>
    sc = SparkContext(conf = conf)
  File "C:\Spark\spark-1.2.0\python\pyspark\context.py", line 105, in __init__
    conf, jsc)
  File "C:\Spark\spark-1.2.0\python\pyspark\context.py", line 153, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "C:\Spark\spark-1.2.0\python\pyspark\context.py", line 201, in _initializ
e_context
    return self._jvm.JavaSparkContext(jconf)
  File "C:\Spark\spark-1.2.0\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.p
y", line 701, in __call__
  File "C:\Spark\spark-1.2.0\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spa
rk.api.java.JavaSparkContext.
: java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
        at org.apache.hadoop.util.Shell.run(Shell.java:379)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
589)
        at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
        at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
        at org.apache.spark.util.Utils$.fetchFile(Utils.scala:411)
        at org.apache.spark.SparkContext.addFile(SparkContext.scala:969)
        at org.apache.spark.SparkContext$$anonfun.apply(SparkContext.scala:28
0)
        at org.apache.spark.SparkContext$$anonfun.apply(SparkContext.scala:28
0)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:280)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.sc
ala:61)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:214)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
.java:79)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)

对于像我这样的 Spark 初学者来说,问题似乎出在这里:"ERROR Shell: Failed to locate the winutils binary in the hadoop binary path"。但是,Spark 文档明确指出,Spark 在独立模式下 运行 不需要安装 Hadoop。

我做错了什么?

好消息是您没有做错任何事,您的代码将在错误得到缓解后 运行。

尽管声明 Spark 将在没有 Hadoop 的情况下 运行 on Windows,它仍然会寻找一些 Hadoop 组件。该错误有一张 JIRA 票证 (SPARK-2356),并且有可用的补丁。从 Spark 1.3.1 开始,补丁还没有提交到主分支。

幸运的是,有一个相当简单的解决方法。

  1. 在您的 Spark 安装目录下为 winutils 创建一个 bin 目录。在我的例子中,Spark 安装在 D:\Languages\Spark,所以我创建了以下路径:D:\Languages\Spark\winutils\bin

  2. 从Hortonworks下载winutils.exe,放到第一步创建的bin目录下。下载 link 适用于 Win64:http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

  3. 创建一个指向 winutils 目录(不是 bin 子目录)的 "HADOOP_HOME" 环境变量。您可以通过几种方式执行此操作:

    • 一个。通过Control Panel -> System -> Advanced System Settings -> Advanced Tab -> Environment variables建立永久环境变量。您可以使用以下参数创建用户变量或系统变量:

      Variable Name=HADOOP_HOME Variable Value=D:\Languages\Spark\winutils\

    • b.在命令中设置一个临时环境变量 shell 在执行你的脚本之前

      set HADOOP_HOME=d:\Languages\Spark\winutils

  4. 运行 你的代码。现在应该可以正常工作了。