将 Spark 添加到 Oozie 共享库

Add Spark to Oozie shared lib

默认情况下,Oozie 共享库目录提供了 Hive、Pig 和 Map-Reduce 的库。如果我想 运行 Oozie 上的 Spark 作业,最好将 Spark lib jar 添加到 Oozie 的共享库而不是将它们复制到应用程序的 lib 目录。
如何将 Spark lib jar(包括 spark-core 及其依赖项)添加到 Oozie 的共享库中?任何评论/回答表示赞赏。

Spark action 计划与 Oozie 4.2.0 一起发布,尽管文档似乎有点落后。在此处查看相关的 JIRA: Oozie JIRA - Add spark action executor

Cloudera 的 CDH 5.4 版本已经有了它,请在此处查看官方文档: CDH 5.4 oozie doc - Oozie Spark Action Extension

使用旧版本的 Oozie,可以通过各种方式共享 jars。第一种方法可能效果最好。无论如何,完整列表:

Below are the various ways to include a jar with your workflow:

Set oozie.libpath=/path/to/jars,another/path/to/jars in job.properties.

This is useful if you have many workflows that all need the same jar; you can put it in one place in HDFS and use it with many workflows. The jars will be available to all actions in that workflow. There is no need to ever point this at the ShareLib location. (I see that in a lot of workflows.) Oozie knows where the ShareLib is and will include it automatically if you set oozie.use.system.libpath=true in job.properties.

在 HDFS 中的 workflow.xml 旁边创建一个名为“lib”的目录,并将 jars 放入其中。

This is useful if you have some jars that you only need for one workflow. Oozie will automatically make those jars available to all actions in that workflow.

在一个动作中用单个jar的路径指定标签;你可以有多个标签。

This is useful if you want some jars only for a specific action and not all actions in a workflow. The downside is that you have to specify them in your workflow.xml, so if you ever need to add/remove some jars, you have to change your workflow.xml.

将 jar 添加到 ShareLib(例如 /user/oozie/share/lib/lib_/pig)

While this will work, it’s not recommended for two reasons: The additional jars will be included with every workflow using that ShareLib, which may be unexpected to those workflows and users. When upgrading the ShareLib, you’ll have to recopy the additional jars to the new ShareLib.

引用自 Rober Kanter 的博客:How-to: Use the ShareLib in Apache Oozie (CDH 5)