hadoop mapreduce teragen FAIL_CONTAINER_CLEANUP
hadoop mapreduce teragen FAIL_CONTAINER_CLEANUP
我的 hadoop 集群遇到了一些问题。
我试着用它做一些基准测试来检查它的性能,看看 mapreduce 是否工作正常,但我得到了一些奇怪的行为。
事实上,mapreduce 正在启动并处理其映射阶段,但我从中得到了一些错误:
我首先使用 teragen 创建数据:
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen 500 random-data
然后作业开始,我在没有停止进程的情况下遇到了一些失败:
17/02/23 12:29:27 INFO client.RMProxy: Connecting to ResourceManager at /172.16.138.145:8032
17/02/23 12:29:28 INFO terasort.TeraSort: Generating 500 using 2
17/02/23 12:29:28 INFO mapreduce.JobSubmitter: number of splits:2
17/02/23 12:29:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1487846108320_0007
17/02/23 12:29:28 INFO impl.YarnClientImpl: Submitted application application_1487846108320_0007
17/02/23 12:29:28 INFO mapreduce.Job: The url to track the job: http://172.16.138.145:8088/proxy/application_1487846108320_0007/
17/02/23 12:29:28 INFO mapreduce.Job: Running job: job_1487846108320_0007
17/02/23 12:29:34 INFO mapreduce.Job: Job job_1487846108320_0007 running in uber mode : false
17/02/23 12:29:34 INFO mapreduce.Job: map 0% reduce 0%
17/02/23 12:29:47 INFO mapreduce.Job: Task Id :
attempt_1487846108320_0007_m_000001_0, Status : FAILED
17/02/23 12:29:48 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_0, Status : FAILED
17/02/23 12:30:02 INFO mapreduce.Job: map 50% reduce 0%
17/02/23 12:30:02 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_1, Status : FAILED
17/02/23 12:30:03 INFO mapreduce.Job: map 0% reduce 0%
17/02/23 12:30:03 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_1, Status : FAILED
17/02/23 12:30:15 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_2, Status : FAILED
17/02/23 12:30:16 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_2, Status : FAILED
17/02/23 12:30:30 INFO mapreduce.Job: map 100% reduce 0%
17/02/23 12:30:31 INFO mapreduce.Job: Job job_1487846108320_0007 failed with state FAILED due to: Task failed task_1487846108320_0007_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
我检查了相关数据节点中的日志,发现每次失败都重复以下几行:
2017-02-23 11:36:12,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
2017-02-23 11:36:12,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1487846108320_0001_m_000001_1:
2017-02-23 11:36:12,902 INFO [ContainerLauncher #5] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1487846108320_0001_01_000004 taskAttempt attempt_1487846108320_0001_m_000001_1
2017-02-23 11:36:12,903 INFO [ContainerLauncher #5] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1487846108320_0001_m_000001_1
2017-02-23 11:36:12,903 INFO [ContainerLauncher #5] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : Datanode3:34121
2017-02-23 11:36:12,923 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP
2017-02-23 11:36:12,924 INFO [CommitterEvent Processor #2] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: TASK_ABORT
2017-02-23 11:36:12,932 WARN [CommitterEvent Processor #2] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete hdfs://172.16.138.145:9000/user/hdfs/random-dataSmallV7.7/_temporary/1/_temporary/attempt_1487846108320_0001_m_000001_1
2017-02-23 11:36:12,932 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED
在这种情况下,作业失败了,但有时我收到错误,但作业会成功。 (很少)
你知道这个 FAIL_CONTAINER_CLEANUP 的原因是什么吗?或者这个问题的潜在原因?
这里只是使用了mappers,没有请求reducer,但是其他情况涉及到reducer时,也会出现这个错误。
提前感谢您的想法。
我终于解决了。
我在一些引用我的节点的 /etc/hosts 文件中有一行:
127.0.1.1 数据节点 1
我用我机器的 FQDN 替换了这一行:
172.16.138.147 数据节点 1
这允许 hadoop 找到我的服务器的引用并修复这个错误。
希望这对其他人有所帮助。
我的 hadoop 集群遇到了一些问题。 我试着用它做一些基准测试来检查它的性能,看看 mapreduce 是否工作正常,但我得到了一些奇怪的行为。 事实上,mapreduce 正在启动并处理其映射阶段,但我从中得到了一些错误: 我首先使用 teragen 创建数据:
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen 500 random-data
然后作业开始,我在没有停止进程的情况下遇到了一些失败:
17/02/23 12:29:27 INFO client.RMProxy: Connecting to ResourceManager at /172.16.138.145:8032
17/02/23 12:29:28 INFO terasort.TeraSort: Generating 500 using 2
17/02/23 12:29:28 INFO mapreduce.JobSubmitter: number of splits:2
17/02/23 12:29:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1487846108320_0007
17/02/23 12:29:28 INFO impl.YarnClientImpl: Submitted application application_1487846108320_0007
17/02/23 12:29:28 INFO mapreduce.Job: The url to track the job: http://172.16.138.145:8088/proxy/application_1487846108320_0007/
17/02/23 12:29:28 INFO mapreduce.Job: Running job: job_1487846108320_0007
17/02/23 12:29:34 INFO mapreduce.Job: Job job_1487846108320_0007 running in uber mode : false
17/02/23 12:29:34 INFO mapreduce.Job: map 0% reduce 0%
17/02/23 12:29:47 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_0, Status : FAILED
17/02/23 12:29:48 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_0, Status : FAILED
17/02/23 12:30:02 INFO mapreduce.Job: map 50% reduce 0%
17/02/23 12:30:02 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_1, Status : FAILED
17/02/23 12:30:03 INFO mapreduce.Job: map 0% reduce 0%
17/02/23 12:30:03 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_1, Status : FAILED
17/02/23 12:30:15 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_2, Status : FAILED
17/02/23 12:30:16 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_2, Status : FAILED
17/02/23 12:30:30 INFO mapreduce.Job: map 100% reduce 0%
17/02/23 12:30:31 INFO mapreduce.Job: Job job_1487846108320_0007 failed with state FAILED due to: Task failed task_1487846108320_0007_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
我检查了相关数据节点中的日志,发现每次失败都重复以下几行:
2017-02-23 11:36:12,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
2017-02-23 11:36:12,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1487846108320_0001_m_000001_1:
2017-02-23 11:36:12,902 INFO [ContainerLauncher #5] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1487846108320_0001_01_000004 taskAttempt attempt_1487846108320_0001_m_000001_1
2017-02-23 11:36:12,903 INFO [ContainerLauncher #5] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1487846108320_0001_m_000001_1
2017-02-23 11:36:12,903 INFO [ContainerLauncher #5] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : Datanode3:34121
2017-02-23 11:36:12,923 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP
2017-02-23 11:36:12,924 INFO [CommitterEvent Processor #2] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: TASK_ABORT
2017-02-23 11:36:12,932 WARN [CommitterEvent Processor #2] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete hdfs://172.16.138.145:9000/user/hdfs/random-dataSmallV7.7/_temporary/1/_temporary/attempt_1487846108320_0001_m_000001_1
2017-02-23 11:36:12,932 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED
在这种情况下,作业失败了,但有时我收到错误,但作业会成功。 (很少) 你知道这个 FAIL_CONTAINER_CLEANUP 的原因是什么吗?或者这个问题的潜在原因? 这里只是使用了mappers,没有请求reducer,但是其他情况涉及到reducer时,也会出现这个错误。
提前感谢您的想法。
我终于解决了。 我在一些引用我的节点的 /etc/hosts 文件中有一行: 127.0.1.1 数据节点 1
我用我机器的 FQDN 替换了这一行: 172.16.138.147 数据节点 1
这允许 hadoop 找到我的服务器的引用并修复这个错误。
希望这对其他人有所帮助。