Hive Vertex 失败,vertexName=Map 2 而 EMR 上的 运行 对于大文件
Hive Vertex failed, vertexName=Map 2 while running on EMR for on big files
我正在 运行 在 EMR 集群上进行我的 hive 查询,这是一个 25 节点的集群,我已经使用 r4.4xlarge 来 运行 这个。
当我 运行 我的查询出现以下错误时。
Job Commit failed with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: FEAF40B78D086BEE; S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=), S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=)'
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.tez.TezTask
/mnt/var/lib/hadoop/steps/s-10YQZ5Z5PRUVJ/./hive-script:617:in `<main>': Error executing cmd: /usr/share/aws/emr/scripts/hive-script "--base-path" "s3://us-east-1.elasticmapreduce/libs/hive/" "--hive-versions" "latest" "--run-hive-script" "--args" "-f" "s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-22_App/d77a6a82-26f4-4f06-a1ea-e83677256a55/01/DeltaOutPut/processing/Scripts/script.sql" (RuntimeError)
Command exiting with ret '1'
我已经尝试设置所有 HIVE 参数组合之王,如下所示
emrfs-site fs.s3.consistent.retryPolicyType exponential
emrfs-site fs.s3.consistent.metadata.tableName EmrFSMetadataAlt
emrfs-site fs.s3.consistent.metadata.write.capacity 300
emrfs-site fs.s3.consistent.metadata.read.capacity 600
emrfs-site fs.s3.consistent true
hive-site hive.exec.stagingdir .hive-staging
hive-site hive.tez.java.opts -Xmx47364m
hive-site hive.stats.fetch.column.stats true
hive-site hive.stats.fetch.partition.stats true
hive-site hive.vectorized.execution.enabled false
hive-site hive.vectorized.execution.reduce.enabled false
hive-site tez.am.resource.memory.mb 15000
hive-site hive.auto.convert.join false
hive-site hive.compute.query.using.stats true
hive-site hive.cbo.enable true
hive-site tez.task.resource.memory.mb 16000
但每次都失败了。
我尝试增加 EMR 集群中 nodes/bigger 个实例的数量,但结果仍然相同。
我也尝试过使用和不使用 Tez,但仍然对我不起作用。
这是我的示例查询。我只是复制我的查询部分
insert into filediffPcfp.TableDelta
Select rgt.FILLER1,rgt.DUNSNUMBER,rgt.BUSINESSNAME,rgt.TRADESTYLENAME,rgt.REGISTEREDADDRESSINDICATOR
请帮我找出问题。
添加完整的纱线日志
2019-02-26 06:28:54,318 [INFO] [TezChild] |exec.FileSinkOperator|: Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to temp file: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_task_tmp.-ext-10000/_tmp.000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: New Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,681 [INFO] [TezChild] |exec.FileSinkOperator|: FS[11]: records written - 1
2019-02-26 06:28:54,877 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000
2019-02-26 06:28:56,632 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 10000
2019-02-26 06:29:13,301 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 100000
2019-02-26 06:31:59,207 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000000
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: Attempting to abort attempt_1551161362408_0001_7_01_000000_1 due to an invocation of shutdownRequested
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Received abort
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Forwarding abort to RecordProcessor
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.MapRecordProcessor|: Forwarding abort to mapOp: {} MAP
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |exec.MapOperator|: Received abort in operator: MAP
2019-02-26 06:34:42,705 [INFO] [TezChild] |s3.S3FSInputStream|: Encountered exception while reading '2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/IncrFile/WB.ACTIVE.OCT17_01_OF_10.gz', will retry by attempting to reopen stream.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException:
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at com.amazon.ws.emr.hadoop.fs.s3n.InputStreamWithInfo.read(InputStreamWithInfo.java:173)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:136)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:179)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:163)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:255)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
从 Tez 模式切换到 MR。它应该开始工作了。同时删除所有与 tez 相关的属性。
set hive.execution.engine=spark;
让我回答我自己的问题。
我们在 EMR 上 运行ning HIVE 作业时注意到的第一件非常重要的事情是 STEP 错误误导性顶点失败不会指向正确的方向。
所以最好检查配置单元日志。
现在,如果我们的实例被终止,那么我们将无法登录到主实例并查看日志,在这种情况下,我们必须查找节点应用程序日志。
我们可以通过以下方式找到节点日志。
获取像这样的主实例 ID (i-04d04d9a8f7d28fd1) 并在节点中搜索。
然后打开下面的路径
/applications/hive/user/hive/hive.log.gz
在这里你可以找到预期的错误。
我们还必须查找失败节点的容器日志,失败节点的详细信息可以在主实例节点中找到。
hadooplogs/j-25RSD7FFOL5JB/node/i-03f8a646a7ae97aae/daemons/
仅当集群 运行ning 时才能找到此守护进程节点日志,否则在终止集群后 EMR 不会将日志推送到 S3 日志 uri。
当我查看它时,我明白了它失败的真正原因。
对我来说,这就是失败的原因
在检查主实例的实例控制器日志时,我看到有多个核心实例进入了不健康状态:
2019-02-27 07:50:03,905 INFO Poller: InstanceJointStatusMap contains 21 entries (R:21):
i-0131b7a6abd0fb8e7 1541s R 1500s ig-28 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com I: 18s Y:U 81s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
i-01672279d170dafd3 1539s R 1500s ig-28 ip-10-97-54-69.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 0.7%
i-0227ac0f0932bd0b3 1539s R 1500s ig-28 ip-10-97-51-197.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 4.1%
i-02355f335c190be40 1544s R 1500s ig-28 ip-10-97-52-150.tr-fr-nonprod.aws-int.thomsonreuters.com I: 22s Y:R 84s c: 0 am:241664 H:R 0.2%
i-024ed22b6affdd5ec 1540s R 1500s ig-28 ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:U 79s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
同样在一段时间后 yarn 将核心实例列入黑名单:
2019-02-27 07:46:39,676 INFO Poller: Determining health status for App Monitor: aws157.instancecontroller.apphealth.monitor.YarnMonitor
2019-02-27 07:46:39,688 INFO Poller: SlaveRecord i-0ac26bd7886fec338 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: SlaveRecord i-0131b7a6abd0fb8e7 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: Update SlaveRecordDbRow for i-0131b7a6abd0fb8e7 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com
2019-02-27 07:47:13,696 INFO Poller: SlaveRecord i-024ed22b6affdd5ec changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,696 INFO Poller: Update SlaveRecordDbRow for i-024ed22b6affdd5ec ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com
在检查实例节点实例控制器日志时,我可以看到 /mnt 由于作业缓存和使用率超过阈值而变满,默认情况下为 90%。
因为这条纱线:
2019-02-27 07:40:52,231 INFO dsm-1: /mnt total 27633 MB free 2068 MB used 25565 MB
2019-02-27 07:40:52,231 INFO dsm-1: / total 100663 MB free 97932 MB used 2731 MB
2019-02-27 07:40:52,231 INFO dsm-1: cycle 17 /mnt/var/log freeSpaceMb: 2068/27633 MB freeRatio:0.07
2019-02-27 07:40:52,248 INFO dsm-1: /mnt/var/log stats :
-> 在我的数据集中,源 table 进行了 .gz 压缩。由于 .gz 压缩文件是非 splitable,因为这 1 个文件分配有 1 个映射任务。并且由于map任务会解压/mnt下的文件,所以也可能会出现这个问题。
-> 在EMR中处理大量数据需要优化一些hive属性。下面是几个可以在集群中设置的优化属性,使查询运行更好。
V.V.V.V.V.I
Increase the EBS volume size for Core instances
重要的是我们必须增加每个核心的 EBS 卷,而不是单独增加主节点,因为 EBS 卷是 /mnt 挂载的地方,而不是在路由上。
仅此一项就解决了我的问题,但下面的配置也帮助我优化了 HIVE 作业
hive-site.xml
-------------
"hive.exec.compress.intermediate" : "true",
"hive.intermediate.compression.codec" : "org.apache.hadoop.io.compress.SnappyCodec",
"hive.intermediate.compression.type" : "BLOCK"
yarn-site.xml
-------------
"max-disk-utilization-per-disk-percentage" : "99"
这已经永久解决了我的问题。
希望有人能从我的回答中受益
我正在 运行 在 EMR 集群上进行我的 hive 查询,这是一个 25 节点的集群,我已经使用 r4.4xlarge 来 运行 这个。
当我 运行 我的查询出现以下错误时。
Job Commit failed with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: FEAF40B78D086BEE; S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=), S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=)'
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.tez.TezTask
/mnt/var/lib/hadoop/steps/s-10YQZ5Z5PRUVJ/./hive-script:617:in `<main>': Error executing cmd: /usr/share/aws/emr/scripts/hive-script "--base-path" "s3://us-east-1.elasticmapreduce/libs/hive/" "--hive-versions" "latest" "--run-hive-script" "--args" "-f" "s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-22_App/d77a6a82-26f4-4f06-a1ea-e83677256a55/01/DeltaOutPut/processing/Scripts/script.sql" (RuntimeError)
Command exiting with ret '1'
我已经尝试设置所有 HIVE 参数组合之王,如下所示
emrfs-site fs.s3.consistent.retryPolicyType exponential
emrfs-site fs.s3.consistent.metadata.tableName EmrFSMetadataAlt
emrfs-site fs.s3.consistent.metadata.write.capacity 300
emrfs-site fs.s3.consistent.metadata.read.capacity 600
emrfs-site fs.s3.consistent true
hive-site hive.exec.stagingdir .hive-staging
hive-site hive.tez.java.opts -Xmx47364m
hive-site hive.stats.fetch.column.stats true
hive-site hive.stats.fetch.partition.stats true
hive-site hive.vectorized.execution.enabled false
hive-site hive.vectorized.execution.reduce.enabled false
hive-site tez.am.resource.memory.mb 15000
hive-site hive.auto.convert.join false
hive-site hive.compute.query.using.stats true
hive-site hive.cbo.enable true
hive-site tez.task.resource.memory.mb 16000
但每次都失败了。 我尝试增加 EMR 集群中 nodes/bigger 个实例的数量,但结果仍然相同。
我也尝试过使用和不使用 Tez,但仍然对我不起作用。
这是我的示例查询。我只是复制我的查询部分
insert into filediffPcfp.TableDelta
Select rgt.FILLER1,rgt.DUNSNUMBER,rgt.BUSINESSNAME,rgt.TRADESTYLENAME,rgt.REGISTEREDADDRESSINDICATOR
请帮我找出问题。
添加完整的纱线日志
2019-02-26 06:28:54,318 [INFO] [TezChild] |exec.FileSinkOperator|: Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to temp file: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_task_tmp.-ext-10000/_tmp.000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: New Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,681 [INFO] [TezChild] |exec.FileSinkOperator|: FS[11]: records written - 1
2019-02-26 06:28:54,877 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000
2019-02-26 06:28:56,632 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 10000
2019-02-26 06:29:13,301 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 100000
2019-02-26 06:31:59,207 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000000
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: Attempting to abort attempt_1551161362408_0001_7_01_000000_1 due to an invocation of shutdownRequested
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Received abort
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Forwarding abort to RecordProcessor
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.MapRecordProcessor|: Forwarding abort to mapOp: {} MAP
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |exec.MapOperator|: Received abort in operator: MAP
2019-02-26 06:34:42,705 [INFO] [TezChild] |s3.S3FSInputStream|: Encountered exception while reading '2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/IncrFile/WB.ACTIVE.OCT17_01_OF_10.gz', will retry by attempting to reopen stream.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException:
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at com.amazon.ws.emr.hadoop.fs.s3n.InputStreamWithInfo.read(InputStreamWithInfo.java:173)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:136)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:179)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:163)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:255)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
从 Tez 模式切换到 MR。它应该开始工作了。同时删除所有与 tez 相关的属性。
set hive.execution.engine=spark;
让我回答我自己的问题。
我们在 EMR 上 运行ning HIVE 作业时注意到的第一件非常重要的事情是 STEP 错误误导性顶点失败不会指向正确的方向。
所以最好检查配置单元日志。
现在,如果我们的实例被终止,那么我们将无法登录到主实例并查看日志,在这种情况下,我们必须查找节点应用程序日志。
我们可以通过以下方式找到节点日志。
获取像这样的主实例 ID (i-04d04d9a8f7d28fd1) 并在节点中搜索。
然后打开下面的路径
/applications/hive/user/hive/hive.log.gz
在这里你可以找到预期的错误。
我们还必须查找失败节点的容器日志,失败节点的详细信息可以在主实例节点中找到。
hadooplogs/j-25RSD7FFOL5JB/node/i-03f8a646a7ae97aae/daemons/
仅当集群 运行ning 时才能找到此守护进程节点日志,否则在终止集群后 EMR 不会将日志推送到 S3 日志 uri。
当我查看它时,我明白了它失败的真正原因。 对我来说,这就是失败的原因
在检查主实例的实例控制器日志时,我看到有多个核心实例进入了不健康状态:
2019-02-27 07:50:03,905 INFO Poller: InstanceJointStatusMap contains 21 entries (R:21):
i-0131b7a6abd0fb8e7 1541s R 1500s ig-28 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com I: 18s Y:U 81s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
i-01672279d170dafd3 1539s R 1500s ig-28 ip-10-97-54-69.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 0.7%
i-0227ac0f0932bd0b3 1539s R 1500s ig-28 ip-10-97-51-197.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 4.1%
i-02355f335c190be40 1544s R 1500s ig-28 ip-10-97-52-150.tr-fr-nonprod.aws-int.thomsonreuters.com I: 22s Y:R 84s c: 0 am:241664 H:R 0.2%
i-024ed22b6affdd5ec 1540s R 1500s ig-28 ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:U 79s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
同样在一段时间后 yarn 将核心实例列入黑名单:
2019-02-27 07:46:39,676 INFO Poller: Determining health status for App Monitor: aws157.instancecontroller.apphealth.monitor.YarnMonitor
2019-02-27 07:46:39,688 INFO Poller: SlaveRecord i-0ac26bd7886fec338 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: SlaveRecord i-0131b7a6abd0fb8e7 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: Update SlaveRecordDbRow for i-0131b7a6abd0fb8e7 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com
2019-02-27 07:47:13,696 INFO Poller: SlaveRecord i-024ed22b6affdd5ec changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,696 INFO Poller: Update SlaveRecordDbRow for i-024ed22b6affdd5ec ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com
在检查实例节点实例控制器日志时,我可以看到 /mnt 由于作业缓存和使用率超过阈值而变满,默认情况下为 90%。
因为这条纱线:
2019-02-27 07:40:52,231 INFO dsm-1: /mnt total 27633 MB free 2068 MB used 25565 MB
2019-02-27 07:40:52,231 INFO dsm-1: / total 100663 MB free 97932 MB used 2731 MB
2019-02-27 07:40:52,231 INFO dsm-1: cycle 17 /mnt/var/log freeSpaceMb: 2068/27633 MB freeRatio:0.07
2019-02-27 07:40:52,248 INFO dsm-1: /mnt/var/log stats :
-> 在我的数据集中,源 table 进行了 .gz 压缩。由于 .gz 压缩文件是非 splitable,因为这 1 个文件分配有 1 个映射任务。并且由于map任务会解压/mnt下的文件,所以也可能会出现这个问题。
-> 在EMR中处理大量数据需要优化一些hive属性。下面是几个可以在集群中设置的优化属性,使查询运行更好。
V.V.V.V.V.I
Increase the EBS volume size for Core instances
重要的是我们必须增加每个核心的 EBS 卷,而不是单独增加主节点,因为 EBS 卷是 /mnt 挂载的地方,而不是在路由上。
仅此一项就解决了我的问题,但下面的配置也帮助我优化了 HIVE 作业
hive-site.xml
-------------
"hive.exec.compress.intermediate" : "true",
"hive.intermediate.compression.codec" : "org.apache.hadoop.io.compress.SnappyCodec",
"hive.intermediate.compression.type" : "BLOCK"
yarn-site.xml
-------------
"max-disk-utilization-per-disk-percentage" : "99"
这已经永久解决了我的问题。
希望有人能从我的回答中受益