火花工作的外部依赖

Question

我是大数据新手 technologies.I 必须运行在 EMR 上以集群模式执行 Spark 作业。该作业是用 python 编写的，它依赖于几个库和一些其他工具。我已经在本地客户端 mode.But 中编写了脚本并运行它在我尝试使用运行它时引起了一些依赖性问题 yarn.How 我是否管理这些依赖性？

日志：

"/mnt/yarn/usercache/hadoop/appcache/application_1511680510570_0144/container_1511680510570_0144_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 711, in subimport
    __import__(name)
ImportError: ('No module named boto3', <function subimport at 0x7f8c3c4f9c80>, ('boto3',))

        at org.apache.spark.api.python.PythonRunner$$anon.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Answer 1

您似乎还没有安装 Boto 3 库。下载兼容的并使用下面的方法安装它

$ pip install boto3

或python -m pip install --user boto3

希望这篇helps.You可以参考link-https://github.com/boto/boto3

那么看来你还没有在所有执行器（节点）上安装boot 3。因为，你是运行 spark，python 代码运行部分在驱动程序上并且 executors.You 需要在所有节点中安装库，如果它的 yarn.

安装same.Please参考-

Answer 2

是的，你可以-

aws emr create-cluster --bootstrap-actions Path=<>,Name=BootstrapAction1,Args=[arg1,arg2].. --auto-terminate.请参考以下-http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapUses

火花工作的外部依赖

External dependency for spark job

emr

hadoop-yarn

pyspark