未能在法兰克福启动 AWS-EMR 集群,但在北弗吉尼亚成功

Failure to launch AWS-EMR cluster in Frankfurt, but successful in N. Virginia

我正在尝试通过适用于 AWS 的 Java SDK 启动小型 EMR 集群。我正在尝试在法兰克福 (eu-central-1) 启动它,但惨遭失败,但在弗吉尼亚北部 (us-east-1) 启动它时成功了。

我的配置:

我已经验证了以下内容:

  1. 我请求的实例类型 (M1Medium) 存在于两个区域中。
  2. 我为集群请求的 Hadoop 版本 (2.7.3) 是 EMR 版本 (5.2.0) 中存在的版本。
  3. 我有适当的 IAM Roles 来支持集群(默认集群 - EMR_EC2_DefaultRoleEMR_DefaultRole),它们显然工作正常,因为它们用于启动集群北弗吉尼亚州。
  4. 我有两个区域的 EC2 密钥对。
  5. 我已经验证 EMR 在这两个地区都可以作为一项服务使用。
  6. 我已经通过我的网络浏览器通过 EC2 仪表板验证了我在两个区域使用了正确的可用性区域,并且这些区域是健康的。
  7. 对于每次集群尝试,我都在同一区域使用 S3 存储桶来存储输入、输出和 EMR 日志。

这是在法兰克福启动集群的代码:

public static void main(String[] args) throws Exception {
    parseArgs(args);

    if (environment.equals("local")) {
        // Local machine, single node setup. Used in order to debug the M-R logic.
        String[] p1args = {"input", "output", environment};
        Phase1.main(p1args);
    } else {
        // EMR setup. This is the main intent of this app.
        AWSCredentials credentials = null;
        try {
            credentials = new ProfileCredentialsProvider().getCredentials();
        } catch (Exception e) {
            throw new AmazonClientException(
                    "Cannot load the credentials from the credential profiles file. " +
                            "Please make sure that your credentials file is at the correct " +
                            "location (~/.aws/credentials), and is in valid format.",
                    e);
        }

        AmazonElasticMapReduce mapReduce = new AmazonElasticMapReduceClient(credentials);

        HadoopJarStepConfig jarStep1 = new HadoopJarStepConfig()
                .withJar("s3n://skill-finder-eu-central-1/jars/SkillFinder.jar")
                .withMainClass("Phase1")
                .withArgs("s3n://skill-finder-eu-central-1/input-10K", "s3n://skill-finder-eu-central-1/output-eu-central-1", environment);

        StepConfig step1Config = new StepConfig()
                .withName("Phase 1")
                .withHadoopJarStep(jarStep1)
                .withActionOnFailure("TERMINATE_JOB_FLOW");

        JobFlowInstancesConfig instances = new JobFlowInstancesConfig()
                .withInstanceCount(5)
                .withMasterInstanceType(InstanceType.M1Medium.toString())
                .withSlaveInstanceType(InstanceType.M1Medium.toString())
                .withHadoopVersion("2.7.3")
                .withEc2KeyName("AWS-EU-CENTRAL-1")
                .withKeepJobFlowAliveWhenNoSteps(false)
                .withPlacement(new PlacementType("eu-central-1a"));

        RunJobFlowRequest runFlowRequest = new RunJobFlowRequest()
                .withName("skill-finder")
                .withInstances(instances)
                .withSteps(step1Config)
                .withJobFlowRole("EMR_EC2_DefaultRole")
                .withServiceRole("EMR_DefaultRole")
                .withReleaseLabel("emr-5.2.0")
                .withLogUri("s3n://skill-finder-eu-central-1/logs/")
                .withBootstrapActions();

        System.out.println("Submitting the JobFlow Request to Amazon EMR and running it...");
        RunJobFlowResult runJobFlowResult = mapReduce.runJobFlow(runFlowRequest);
        String jobFlowId = runJobFlowResult.getJobFlowId();
        System.out.println("Ran job flow with id: " + jobFlowId);
    }

}

在弗吉尼亚北部发布时,我只是将 eu-central-1 替换为 us-east-1

这是例外情况:

Exception in thread "main" com.amazonaws.services.elasticmapreduce.model.AmazonElasticMapReduceException: Specified Availability Zone is not supported. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: 578db9ad-b3bf-11e6-9a57-5179acb16d3f)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1545)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1183)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:964)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:676)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:650)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:633)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access0(AmazonHttpClient.java:601)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:583)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:447)
at com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient.doInvoke(AmazonElasticMapReduceClient.java:1469)
at com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient.invoke(AmazonElasticMapReduceClient.java:1445)
at com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient.runJobFlow(AmazonElasticMapReduceClient.java:1255)
at MRTaskLauncher.main(MRTaskLauncher.java:97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

好的,找到了解决方案:我用 M3Xlarge 个实例而不是 M2Medium 启动了集群。很有魅力!

我是如何做到这一点的:

  1. 自从我设法使用 EMR 的默认 IAM 角色在弗吉尼亚北部启动集群后,我开始认为我可能在身份验证方面遇到问题。当我设法通过 CLI 在法兰克福启动集群时,这得到了进一步的支持(在 Create and Use IAM Roles with the AWS CLI 下找到了示例 here)。
  2. 我接下来要做的是尝试通过 SDK 重新启动集群。集群失败,但我复制了启动命令,以便可以通过 CLI 启动。为此,我单击了 EMR 集群列表(Web 界面)中的集群,单击 View cluster details,然后单击顶行的按钮 AWS CLI export
  3. 令我惊讶的是,CLI 提供了更具体的错误消息(与列出验证错误的 Web 界面相比),这表明罪魁祸首是实例类型!然后我检查了 here 以找出法兰克福可用的实例,并选择了一个不需要 VPC 的实例(M4 需要它),因为我没有精力开始弄乱那些东西。
  4. 有点前奏 - 列出的验证错误导致我找到 。正是这个问题促使我研究默认 IAM 角色的问题,并尝试使用 CLI。