Hdfs 到 s3 Distcp - 访问密钥

Hdfs to s3 Distcp - Access Keys

为了将文件从 HDFS 复制到 S3 存储桶,我使用了命令

hadoop distcp -Dfs.s3a.access.key=ACCESS_KEY_HERE\
-Dfs.s3a.secret.key=SECRET_KEY_HERE /path/in/hdfs s3a:/BUCKET NAME

但是访问密钥和 sectet 密钥在这里是可见的,它们是不安全的。 有没有什么方法可以从文件中提供凭据。 我不想编辑配置文件,这是我遇到的方法之一。

Amazon 允许生成您可以从 http://169.254.169.254/latest/meta-data/iam/security-credentials/

检索的临时凭证

你可以阅读from there

An application on the instance retrieves the security credentials provided by the role from the instance metadata item iam/security-credentials/role-name. The application is granted the permissions for the actions and resources that you've defined for the role through the security credentials associated with the role. These security credentials are temporary and we rotate them automatically. We make new credentials available at least five minutes prior to the expiration of the old credentials.

以下命令检索名为 s3access 的 IAM 角色的安全凭证。

$ curl http://169.254.169.254/latest/meta-data/iam/security-credentials/s3access

以下是示例输出。

{
  "Code" : "Success",
  "LastUpdated" : "2012-04-26T16:39:16Z",
  "Type" : "AWS-HMAC",
  "AccessKeyId" : "AKIAIOSFODNN7EXAMPLE",
  "SecretAccessKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  "Token" : "token",
  "Expiration" : "2012-04-27T22:39:16Z"
}

For applications, AWS CLI, and Tools for Windows PowerShell commands that run on the instance, you do not have to explicitly get the temporary security credentials — the AWS SDKs, AWS CLI, and Tools for Windows PowerShell automatically get the credentials from the EC2 instance metadata service and use them. To make a call outside of the instance using temporary security credentials (for example, to test IAM policies), you must provide the access key, secret key, and the session token. For more information, see Using Temporary Security Credentials to Request Access to AWS Resources in the IAM User Guide.

我也遇到过同样的情况,在从 matadata 实例获得临时凭证之后。 (如果您使用的是 IAM 用户的凭证,请注意这里提到的临时凭证是 IAM 角色,它附加到 EC2 服务器而不是人类,参考 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html

我发现仅在 hadoop distcp cmd 中指定凭据是行不通的。 您还必须指定配置 fs.s3a.aws.credentials.provider。 (参考http://hortonworks.github.io/hdp-aws/s3-security/index.html#using-temporary-session-credentials

最终命令如下所示

hadoop distcp -Dfs.s3a.aws.credentials.provider="org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider" -Dfs.s3a.access.key="{AccessKeyId}" -Dfs.s3a.secret.key="{SecretAccessKey}" -Dfs.s3a.session.token="{SessionToken}" s3a://bucket/prefix/file /path/on/hdfs

最近的 (2.8+) 版本允许您在 jceks 文件中隐藏您的凭据; Hadoop s3 页面上有一些文档。这样:根本不需要在命令行上输入任何秘密;您只需在集群中共享它们,然后在 distcp 命令中,将 hadoop.security.credential.provider.path 设置为路径,例如 jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks

粉丝:如果您在 EC2 中 运行,IAM 角色凭证应该自动从默认的凭证提供者链中获取:在查找配置选项和环境变量后,它会尝试 GET提供会话凭证的 EC2 http 端点。如果没有发生这种情况,请确保 com.amazonaws.auth.InstanceProfileCredentialsProvider 在凭据提供者列表中。它比其他的慢一点(并且可能会受到限制),所以最好放在最后。

如果您不想使用访问密钥和密钥(或在您的脚本中显示它们)并且如果您的 EC2 实例可以访问 S3,那么您可以使用实例凭证

hadoop distcp \
-Dfs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider" \
/hdfs_folder/myfolder \
s3a://bucket/myfolder

不确定是否是因为版本差异,但是要使用 "secrets from credential providers" -Dfs 标志对我来说不起作用,我不得不使用 -D 标志,如图所示在 hadoop version 3.1.3 "Using_secrets_from_credential_providers" docs.

首先,我将我的 AWS S3 凭证保存在 Java 加密扩展密钥库 (JCEKS) 文件中。

hadoop credential create fs.s3a.access.key \
-provider jceks://hdfs/user/$USER/s3.jceks \
-value <my_AWS_ACCESS_KEY>

hadoop credential create fs.s3a.secret.key \
-provider jceks://hdfs/user/$USER/s3.jceks \
-value <my_AWS_SECRET_KEY>

然后下面的 distcp 命令格式对我有用。

hadoop distcp \
-D hadoop.security.credential.provider.path=jceks://hdfs/user/$USER/s3.jceks \
/hdfs_folder/myfolder \
s3a://bucket/myfolder