python Oozie 中的 Spark 作业使用 spark 操作
python Spark job in Oozie using spark action
我一直在尝试 运行 spark(1.3.1.2.3) 中的 python 脚本,我正在使用 oozie 来安排 spark 作业。我有一个使用 Ambari 2.1.1 安装的 3 节点集群 运行ning HDP 2.3。
我在执行作业时运行遇到以下错误..
>>> Invoking Main class now >>>
Fetching child yarn jobs
tag id : oozie-9d5f396daac34b4a41fed946fac0472
Child yarn jobs are found -
Spark Action Main class : org.apache.spark.deploy.SparkSubmit
Oozie Spark action configuration
=================================================================
--master
yarn-client
--deploy-mode
client
--name
boxplot outlier
--class
/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py
--executor-memory
1G
--driver-memory
1G
--executor-cores
4
--num-executors
2
--conf
spark.yarn.queue=default
--verbose
/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py
=================================================================
>>> Invoking Spark class now >>>
Traceback (most recent call last):
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 129, in <module>
main()
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 60, in main
sc = SparkContext(conf=conf)
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 107, in __init__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 155, in _do_init
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 201, in _initialize_context
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/java_gateway.py", line 701, in __call__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
at java.lang.ClassLoader.checkCerts(ClassLoader.java:895)
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:665)
at java.lang.ClassLoader.defineClass(ClassLoader.java:758)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access0(URLClassLoader.java:73)
at java.net.URLClassLoader.run(URLClassLoader.java:368)
at java.net.URLClassLoader.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.eclipse.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:136)
at org.eclipse.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:129)
at org.eclipse.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:98)
at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:98)
at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:89)
at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:67)
at org.apache.spark.ui.WebUI$$anonfun$attachTab.apply(WebUI.scala:60)
at org.apache.spark.ui.WebUI$$anonfun$attachTab.apply(WebUI.scala:60)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:60)
at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:66)
at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:60)
at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:42)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
这是我的 workflow.xml 文件
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns='uri:oozie:workflow:0.4' name='sparkjob'>
<start to='spark-process' />
<action name='spark-process'>
<spark xmlns='uri:oozie:spark-action:0.1'>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapred.job.queue.name</name>
<value>launcher2</value>
</property>
<property>
<name>oozie.service.SparkConfigurationService.spark.configurations</name>
<value>spark.eventLog.dir=hdfs://node1.analytics.tardis:8020/user/spark/applicationHistory,spark.yarn.historyServer.address=http://node1.analytics.tardis:18088,spark.eventLog.enabled=true</value>
</property>
</configuration>
<master>yarn-client</master>
<mode>client</mode>
<name>boxplot outlier</name>
<class>/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py</class>
<jar>/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py</jar>
<spark-opts>--executor-memory 1G --driver-memory 1G --executor-cores 4 --num-executors 2 --conf spark.yarn.queue=default</spark-opts>
</spark>
<ok to='end'/>
<error to='spark-fail'/>
</action>
<kill name='spark-fail'>
<message>Spark job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
从最初的搜索来看,错误似乎是由于打包包含 运行ning spark 作业代码的 jar 文件时的依赖关系冲突而弹出的。 python 脚本 boxplot_outlier.py 不导入任何可能导致此类冲突的依赖项。
需要一些指导!任何建议将不胜感激。
编辑:我检查了 Oozie Java/Map-Reduce/Pig 操作启动器作业配置的类路径元素,它包括以下两个 jar
/hadoop/yarn/local/usercache/ambari-qa/appcache/application_1441804290161_0903/container_e03_1441804290161_0903_01_000002/mr-framework/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar
/hadoop/yarn/local/usercache/ambari-qa/appcache/application_1441804290161_0903/container_e03_1441804290161_0903_01_000002/javax.servlet-3.0.0.v201112011016.jar
从 SPARK-1693 中的讨论看来,这两个 jar 可能会导致这种依赖冲突。尽管该问题已在版本 1.1.0 本身中得到解决。 hadoop 2.7 的依赖项可能存在问题,或者我可能缺少某些配置。任何帮助将不胜感激
终于解决了。事实证明,从 hdfs 中的 oozie sharelib spark 目录中删除 javax.servlet-3.0.0.v201112011016.jar 可缓解此问题。我不确定这是否是解决问题的正确方法,以及这是否是 HDP 2.3.0 发行版的配置问题。将向 HDP 人员报告以进行进一步调查。
在 Cloudera CDH 5.5.2 上也看到了同样的问题。
我找不到任何关于这是一个已知问题的参考。
从 sharelib 中删除似乎是一个大 hack。
为了验证这一理论,从 sharelib 中删除了 javax.servlet-3.0.0.v201112011016.jar 并进行了 sharelib 更新(否则 Oozie 说文件丢失),添加了 javax.servlet-api-3.1.0.jar 到我自己的自定义 oozie.libpath (也可以在 sharelib 中但不想这样做)并且问题消失了。肯定还有别的办法。
不过还是在这里分享一下,以防有帮助。
我一直在尝试 运行 spark(1.3.1.2.3) 中的 python 脚本,我正在使用 oozie 来安排 spark 作业。我有一个使用 Ambari 2.1.1 安装的 3 节点集群 运行ning HDP 2.3。
我在执行作业时运行遇到以下错误..
>>> Invoking Main class now >>>
Fetching child yarn jobs
tag id : oozie-9d5f396daac34b4a41fed946fac0472
Child yarn jobs are found -
Spark Action Main class : org.apache.spark.deploy.SparkSubmit
Oozie Spark action configuration
=================================================================
--master
yarn-client
--deploy-mode
client
--name
boxplot outlier
--class
/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py
--executor-memory
1G
--driver-memory
1G
--executor-cores
4
--num-executors
2
--conf
spark.yarn.queue=default
--verbose
/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py
=================================================================
>>> Invoking Spark class now >>>
Traceback (most recent call last):
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 129, in <module>
main()
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 60, in main
sc = SparkContext(conf=conf)
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 107, in __init__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 155, in _do_init
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 201, in _initialize_context
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/java_gateway.py", line 701, in __call__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
at java.lang.ClassLoader.checkCerts(ClassLoader.java:895)
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:665)
at java.lang.ClassLoader.defineClass(ClassLoader.java:758)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access0(URLClassLoader.java:73)
at java.net.URLClassLoader.run(URLClassLoader.java:368)
at java.net.URLClassLoader.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.eclipse.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:136)
at org.eclipse.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:129)
at org.eclipse.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:98)
at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:98)
at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:89)
at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:67)
at org.apache.spark.ui.WebUI$$anonfun$attachTab.apply(WebUI.scala:60)
at org.apache.spark.ui.WebUI$$anonfun$attachTab.apply(WebUI.scala:60)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:60)
at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:66)
at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:60)
at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:42)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
这是我的 workflow.xml 文件
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns='uri:oozie:workflow:0.4' name='sparkjob'>
<start to='spark-process' />
<action name='spark-process'>
<spark xmlns='uri:oozie:spark-action:0.1'>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapred.job.queue.name</name>
<value>launcher2</value>
</property>
<property>
<name>oozie.service.SparkConfigurationService.spark.configurations</name>
<value>spark.eventLog.dir=hdfs://node1.analytics.tardis:8020/user/spark/applicationHistory,spark.yarn.historyServer.address=http://node1.analytics.tardis:18088,spark.eventLog.enabled=true</value>
</property>
</configuration>
<master>yarn-client</master>
<mode>client</mode>
<name>boxplot outlier</name>
<class>/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py</class>
<jar>/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py</jar>
<spark-opts>--executor-memory 1G --driver-memory 1G --executor-cores 4 --num-executors 2 --conf spark.yarn.queue=default</spark-opts>
</spark>
<ok to='end'/>
<error to='spark-fail'/>
</action>
<kill name='spark-fail'>
<message>Spark job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
从最初的搜索来看,错误似乎是由于打包包含 运行ning spark 作业代码的 jar 文件时的依赖关系冲突而弹出的。 python 脚本 boxplot_outlier.py 不导入任何可能导致此类冲突的依赖项。
需要一些指导!任何建议将不胜感激。
编辑:我检查了 Oozie Java/Map-Reduce/Pig 操作启动器作业配置的类路径元素,它包括以下两个 jar
/hadoop/yarn/local/usercache/ambari-qa/appcache/application_1441804290161_0903/container_e03_1441804290161_0903_01_000002/mr-framework/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar
/hadoop/yarn/local/usercache/ambari-qa/appcache/application_1441804290161_0903/container_e03_1441804290161_0903_01_000002/javax.servlet-3.0.0.v201112011016.jar
从 SPARK-1693 中的讨论看来,这两个 jar 可能会导致这种依赖冲突。尽管该问题已在版本 1.1.0 本身中得到解决。 hadoop 2.7 的依赖项可能存在问题,或者我可能缺少某些配置。任何帮助将不胜感激
终于解决了。事实证明,从 hdfs 中的 oozie sharelib spark 目录中删除 javax.servlet-3.0.0.v201112011016.jar 可缓解此问题。我不确定这是否是解决问题的正确方法,以及这是否是 HDP 2.3.0 发行版的配置问题。将向 HDP 人员报告以进行进一步调查。
在 Cloudera CDH 5.5.2 上也看到了同样的问题。 我找不到任何关于这是一个已知问题的参考。 从 sharelib 中删除似乎是一个大 hack。
为了验证这一理论,从 sharelib 中删除了 javax.servlet-3.0.0.v201112011016.jar 并进行了 sharelib 更新(否则 Oozie 说文件丢失),添加了 javax.servlet-api-3.1.0.jar 到我自己的自定义 oozie.libpath (也可以在 sharelib 中但不想这样做)并且问题消失了。肯定还有别的办法。
不过还是在这里分享一下,以防有帮助。