来自 S3 的 AWS Boto3 EMR 软件设置配置

AWS Boto3 EMR Software Settings Configuration From S3

当您通过 AWS 管理控制台创建新的 AWS EMR 集群时,您可以提供 JSON 软件配置。您可以将 JSON 文件放入 S3 存储桶中,并通过以下字段将软件配置指向 S3 存储桶,

我需要通过 AWS Python SDK Boto3 库执行此操作,但我在他们的示例中的可用字段中看不到在哪里执行此操作,

response = client.run_job_flow(
    Name='string',
    LogUri='string',
    AdditionalInfo='string',
    AmiVersion='string',
    ReleaseLabel='string',
    Instances={
        'MasterInstanceType': 'string',
        'SlaveInstanceType': 'string',
        'InstanceCount': 123,
        'InstanceGroups': [
            {
                'Name': 'string',
                'Market': 'ON_DEMAND'|'SPOT',
                'InstanceRole': 'MASTER'|'CORE'|'TASK',
                'BidPrice': 'string',
                'InstanceType': 'string',
                'InstanceCount': 123,
                'Configurations': [
                    {
                        'Classification': 'string',
                        'Configurations': {'... recursive ...'},
                        'Properties': {
                            'string': 'string'
                        }
                    },
                ],
                'EbsConfiguration': {
                    'EbsBlockDeviceConfigs': [
                        {
                            'VolumeSpecification': {
                                'VolumeType': 'string',
                                'Iops': 123,
                                'SizeInGB': 123
                            },
                            'VolumesPerInstance': 123
                        },
                    ],
                    'EbsOptimized': True|False
                },
                'AutoScalingPolicy': {
                    'Constraints': {
                        'MinCapacity': 123,
                        'MaxCapacity': 123
                    },
                    'Rules': [
                        {
                            'Name': 'string',
                            'Description': 'string',
                            'Action': {
                                'Market': 'ON_DEMAND'|'SPOT',
                                'SimpleScalingPolicyConfiguration': {
                                    'AdjustmentType': 'CHANGE_IN_CAPACITY'|'PERCENT_CHANGE_IN_CAPACITY'|'EXACT_CAPACITY',
                                    'ScalingAdjustment': 123,
                                    'CoolDown': 123
                                }
                            },
                            'Trigger': {
                                'CloudWatchAlarmDefinition': {
                                    'ComparisonOperator': 'GREATER_THAN_OR_EQUAL'|'GREATER_THAN'|'LESS_THAN'|'LESS_THAN_OR_EQUAL',
                                    'EvaluationPeriods': 123,
                                    'MetricName': 'string',
                                    'Namespace': 'string',
                                    'Period': 123,
                                    'Statistic': 'SAMPLE_COUNT'|'AVERAGE'|'SUM'|'MINIMUM'|'MAXIMUM',
                                    'Threshold': 123.0,
                                    'Unit': 'NONE'|'SECONDS'|'MICRO_SECONDS'|'MILLI_SECONDS'|'BYTES'|'KILO_BYTES'|'MEGA_BYTES'|'GIGA_BYTES'|'TERA_BYTES'|'BITS'|'KILO_BITS'|'MEGA_BITS'|'GIGA_BITS'|'TERA_BITS'|'PERCENT'|'COUNT'|'BYTES_PER_SECOND'|'KILO_BYTES_PER_SECOND'|'MEGA_BYTES_PER_SECOND'|'GIGA_BYTES_PER_SECOND'|'TERA_BYTES_PER_SECOND'|'BITS_PER_SECOND'|'KILO_BITS_PER_SECOND'|'MEGA_BITS_PER_SECOND'|'GIGA_BITS_PER_SECOND'|'TERA_BITS_PER_SECOND'|'COUNT_PER_SECOND',
                                    'Dimensions': [
                                        {
                                            'Key': 'string',
                                            'Value': 'string'
                                        },
                                    ]
                                }
                            }
                        },
                    ]
                }
            },
        ],
        'InstanceFleets': [
            {
                'Name': 'string',
                'InstanceFleetType': 'MASTER'|'CORE'|'TASK',
                'TargetOnDemandCapacity': 123,
                'TargetSpotCapacity': 123,
                'InstanceTypeConfigs': [
                    {
                        'InstanceType': 'string',
                        'WeightedCapacity': 123,
                        'BidPrice': 'string',
                        'BidPriceAsPercentageOfOnDemandPrice': 123.0,
                        'EbsConfiguration': {
                            'EbsBlockDeviceConfigs': [
                                {
                                    'VolumeSpecification': {
                                        'VolumeType': 'string',
                                        'Iops': 123,
                                        'SizeInGB': 123
                                    },
                                    'VolumesPerInstance': 123
                                },
                            ],
                            'EbsOptimized': True|False
                        },
                        'Configurations': [
                            {
                                'Classification': 'string',
                                'Configurations': {'... recursive ...'},
                                'Properties': {
                                    'string': 'string'
                                }
                            },
                        ]
                    },
                ],
                'LaunchSpecifications': {
                    'SpotSpecification': {
                        'TimeoutDurationMinutes': 123,
                        'TimeoutAction': 'SWITCH_TO_ON_DEMAND'|'TERMINATE_CLUSTER',
                        'BlockDurationMinutes': 123
                    }
                }
            },
        ],
        'Ec2KeyName': 'string',
        'Placement': {
            'AvailabilityZone': 'string',
            'AvailabilityZones': [
                'string',
            ]
        },
        'KeepJobFlowAliveWhenNoSteps': True|False,
        'TerminationProtected': True|False,
        'HadoopVersion': 'string',
        'Ec2SubnetId': 'string',
        'Ec2SubnetIds': [
            'string',
        ],
        'EmrManagedMasterSecurityGroup': 'string',
        'EmrManagedSlaveSecurityGroup': 'string',
        'ServiceAccessSecurityGroup': 'string',
        'AdditionalMasterSecurityGroups': [
            'string',
        ],
        'AdditionalSlaveSecurityGroups': [
            'string',
        ]
    },
    Steps=[
        {
            'Name': 'string',
            'ActionOnFailure': 'TERMINATE_JOB_FLOW'|'TERMINATE_CLUSTER'|'CANCEL_AND_WAIT'|'CONTINUE',
            'HadoopJarStep': {
                'Properties': [
                    {
                        'Key': 'string',
                        'Value': 'string'
                    },
                ],
                'Jar': 'string',
                'MainClass': 'string',
                'Args': [
                    'string',
                ]
            }
        },
    ],
    BootstrapActions=[
        {
            'Name': 'string',
            'ScriptBootstrapAction': {
                'Path': 'string',
                'Args': [
                    'string',
                ]
            }
        },
    ],
    SupportedProducts=[
        'string',
    ],
    NewSupportedProducts=[
        {
            'Name': 'string',
            'Args': [
                'string',
            ]
        },
    ],
    Applications=[
        {
            'Name': 'string',
            'Version': 'string',
            'Args': [
                'string',
            ],
            'AdditionalInfo': {
                'string': 'string'
            }
        },
    ],
    Configurations=[
        {
            'Classification': 'string',
            'Configurations': {'... recursive ...'},
            'Properties': {
                'string': 'string'
            }
        },
    ],
    VisibleToAllUsers=True|False,
    JobFlowRole='string',
    ServiceRole='string',
    Tags=[
        {
            'Key': 'string',
            'Value': 'string'
        },
    ],
    SecurityConfiguration='string',
    AutoScalingRole='string',
    ScaleDownBehavior='TERMINATE_AT_INSTANCE_HOUR'|'TERMINATE_AT_TASK_COMPLETION',
    CustomAmiId='string',
    EbsRootVolumeSize=123,
    RepoUpgradeOnBoot='SECURITY'|'NONE',
    KerberosAttributes={
        'Realm': 'string',
        'KdcAdminPassword': 'string',
        'CrossRealmTrustPrincipalPassword': 'string',
        'ADDomainJoinUser': 'string',
        'ADDomainJoinPassword': 'string'
    }
)

如何提供包含软件配置 JSON 文件的 S3 存储桶位置以通过 Boto3 库创建 EMR 集群?

Configuring Applications - Amazon EMR 文档说:

Supplying a Configuration in the Console

To supply a configuration, you navigate to the Create cluster page and choose Edit software settings. You can then enter the configuration directly (in JSON or using shorthand syntax demonstrated in shadow text) in the console or provide an Amazon S3 URI for a file with a JSON Configurations object.

这似乎是你在问题中表现出的能力。

文档随后展示了如何通过 CLI 执行此操作:

aws emr create-cluster --use-default-roles --release-label emr-5.14.0 --instance-type m4.large --instance-count 2 --applications Name=Hive --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

这映射到您在上面显示的 JSON 中的 Configurations 选项:

                    'Configurations': [
                        {
                            'Classification': 'string',
                            'Configurations': {'... recursive ...'},
                            'Properties': {
                                'string': 'string'
                            }
                        },
                    ]

Configurations: A configuration classification that applies when provisioning cluster instances, which can include configurations for applications and software that run on the cluster.

它将包含如下设置:

[
  {
    "Classification": "core-site",
    "Properties": {
      "hadoop.security.groups.cache.secs": "250"
    }
  },
  {
    "Classification": "mapred-site",
    "Properties": {
      "mapred.tasktracker.map.tasks.maximum": "2",
      "mapreduce.map.sort.spill.percent": "0.90",
      "mapreduce.tasktracker.reduce.tasks.maximum": "5"
    }
  }
]

简答: Configurations

目前 boto3 SDK 无法直接从 s3 导入配置设置作为 run_job_flow() 函数的一部分。您需要在 boto3 中设置 S3 客户端,将数据下载为 S3 对象,然后使用 S3 文件中的 JSON 数据更新 EMR 字典的配置列表部分。

如何从 S3 下载 json 文件然后将其作为 Python Dict 加载到内存中的示例可以在此处找到 - Reading an JSON file from S3 using Python boto3