如何上传文件到 Amazon EMR？

Question

我的代码如下：

# test2.py

from pyspark import SparkContext, SparkConf, SparkFiles
conf = SparkConf()
sc = SparkContext(
    appName="test",
    conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
with open(SparkFiles.get("test_warc.txt")) as f:
  print("opened")
sc.stop()

它在我运行在本地使用时有效：

spark-submit --deploy-mode client --files ../input/test_warc.txt test2.py

但在向 EMR 集群添加步骤后：

spark-submit --deploy-mode cluster --files s3://brand17-stock-prediction/test_warc.txt s3://brand17-stock-prediction/test2.py

我遇到错误：

FileNotFoundError: [Errno 2] No such file or directory: '/mnt1/yarn/usercache/hadoop/appcache/application_1618078674774_0001/spark-e7c93ba0-7d30-4e52-8f1b-14dda6ff599c/userFiles-5bb8ea9f-189d-4256-803f-0414209e7862/test_warc.txt'

文件路径正确，但由于某种原因不是从 s3 上传的。

我尝试从执行器加载：

from pyspark import SparkContext, SparkConf, SparkFiles
from operator import add

conf = SparkConf()
sc = SparkContext(
    appName="test",
    conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
def f(_):
    a = 0
    with open(SparkFiles.get("test_warc.txt")) as f:
      a += 1
      print("opened")
    return a#test_module.test()
count = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(count) # printing 2

sc.stop()

而且它没有错误。

看起来像 --files 参数仅将文件上传到执行程序。我怎样才能上传到大师？

Answer 1

你的理解是正确的。

--files argument is uploading files to executors only.

在 spark 文档中查看这个

file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.

您可以在 advanced-dependency-management

阅读更多相关信息

现在回到你的第二个问题

How can I upload to master?

EMR中有一个bootstrap-action的概念。从官方文档来看，它的意思如下：

You can use a bootstrap action to install additional software or customize the configuration of cluster instances. Bootstrap actions are scripts that run on cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data.

如何在我的案例中使用它？

生成集群时，您可以在 BootstrapActions JSON 中指定脚本，如下所示以及其他自定义配置：

BootstrapActions=[
            {'Name': 'Setup Environment for downloading my script',
             'ScriptBootstrapAction':
                 {
                     'Path': 's3://your-bucket-name/path-to-custom-scripts/download-file.sh'
                 }
             }]

download-file.sh 的内容应如下所示：

#!/bin/bash
set -x
workingDir=/opt/your-path/
sudo mkdir -p $workingDir
sudo aws s3 cp s3://your-bucket/path-to-your-file/test_warc.txt $workingDir

现在在您的 python 脚本中，您可以使用文件 workingDir/test_warc.txt 来读取文件。

还有一个选项可以仅在主节点 only/task 节点上执行您的 bootstrap 操作，或者两者兼而有之。 bootstrap-actions/run-if 是我们可以用于这种情况的脚本。可以在 emr-bootstrap-runif

阅读更多相关内容

如何上传文件到 Amazon EMR？

How to upload files to Amazon EMR?

amazon-emr

apache-spark

pyspark