在本地哪里设置 Spark 中的 S3 配置?

Where to set the S3 configuration in Spark locally?

我已经设置了一个 docker 容器,它使用 spark 启动了一个 jupyter notebook。我已经将必要的 jar 集成到 spark 的 directoy 中,以便能够访问 S3 文件系统。 我的 Dockerfile:

FROM jupyter/pyspark-notebook

EXPOSE 8080 7077 6066

RUN conda install -y --prefix /opt/conda pyspark==3.2.1
USER root

RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.1/hadoop-aws-3.2.1.jar)
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.213/aws-java-sdk-bundle-1.12.213.jar )

# The aws sdk relies on guava, but the default guava lib in jars is too old for being compatible
RUN rm /usr/local/spark/jars/guava-14.0.1.jar
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/google/guava/guava/29.0-jre/guava-29.0-jre.jar )

USER jovyan

ENV AWS_ACCESS_KEY_ID=XXXXX
ENV AWS_SECRET_ACCESS_KEY=XXXXX
ENV PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser"
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter

到目前为止效果很好。但是,每次我在 jupyter 中创建内核会话时,我都需要手动设置 EnvironmentCredentialsProvider,因为默认情况下,它期望 IAMInstanceCredentialsProvider 提供显然不存在的凭据。因此,我每次都需要在 jupyter 中设置:

spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.EnvironmentVariableCredentialsProvider")

我可以在文件的某处配置它,以便将 credentialprovider 正确设置为默认值吗?

我尝试创建 ~/.aws/credentials 以查看 spark 是否会默认从那里读取凭据,但不会。

s3a 连接器实际上会查找 s3a 选项,然后是环境变量,然后是 IAM 属性 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3

您的 spark 默认配置文件可能有问题

在浏览网页几天后,我发现 spark-defaults.conf:

中缺少的正确属性(虽然不在官方文档中)
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.EnvironmentVariableCredentialsProvider