同时使用 S3 和 Redshift 时的凭据问题

Question

我是运行一个 Spark SQL 程序，从 S3 和 Redshift 获取数据，加入数据，然后写回 EMR 上的 Redshift。我有一个凭据问题，一旦我查询 Redshift，我就无法再访问 EMR，我的程序错误为：

pyspark.sql.utils.IllegalArgumentException: u'AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).'

连接到 redshift 的代码是：

df.write \
 .format("com.databricks.spark.redshift") \
 .option("url", rs_jdbc + ":" + rs_port + "/" + rs_db + "?user=" + rs_username + "&password=" + rs_password) \
 .option("dbtable", table) \
 .option("tempdir", s3_temp_out) \
 .mode("error") \
 .save(mode='append')

如有任何帮助，我们将不胜感激

Answer 1

我不建议使用access key和secret key。最好使用here.

中描述的对应角色的arn

Have Redshift assume an IAM role (most secure): You can grant Redshift permission to assume an IAM role during COPY or UNLOAD operations and then configure this library to instruct Redshift to use that role:
Create an IAM role granting appropriate S3 permissions to your bucket.
Follow the guide Authorizing Amazon Redshift to Access Other AWS Services On Your Behalf to configure this role's trust policy in order
to allow Redshift to assume this role. Follow the steps in the Authorizing COPY and UNLOAD Operations Using IAM Roles guide to associate that IAM role with your Redshift cluster. Set this library's aws_iam_role option to the role's ARN.

同时使用 S3 和 Redshift 时的凭据问题

Credential problems when using both S3 and Redshift

amazon-s3

amazon-emr

amazon-redshift

pyspark

pyspark-sql