获取失败次数过多

Question

我在 Ubuntu 12.04 和 Hadoop 1.2.1 上设置了一个 2 节点 hadoop 集群。当我尝试运行 hadoop 字数统计示例时，我很生气“Too many fetch faliure error”。我参考了很多文章，但我无法弄清楚 Masters、Slaves 和 /etc/hosts 文件中的条目应该是什么。我的节点名称是“master”，ip 10.0.0.1 和 "slaveone"，ip 10.0.0.2。

我需要帮助 master 和 slave 节点中的主从和 /etc/hosts 文件中的条目应该是什么？

Answer 1

如果您出于任何原因无法升级集群，您可以尝试以下操作：

确保您的主机名已绑定到网络 IP，并且 NOT /etc/hosts
确保您只使用主机名而不是 IP 来引用服务。
如果以上设置正确，请尝试以下设置：

set mapred.reduce.slowstart.completed.maps=0.80
set tasktracker.http.threads=80
set mapred.reduce.parallel.copies=(>= 10)(10 should probably be sufficient)

另请查看此 SO post：Why I am getting "Too many fetch-failures" every other day

还有这个：Too many fetch failures: Hadoop on cluster (x2)

如果以上没有帮助，还有这个：http://grokbase.com/t/hadoop/common-user/098k7y5t4n/how-to-deal-with-too-many-fetch-failures 为了简洁和节省时间，我将我发现最相关的内容放在这里。

The number 1 cause of this is something that causes a connection to get a map output to fail. I have seen: 1) firewall 2) misconfigured ip addresses (ie: the task tracker attempting the fetch received an incorrect ip address when it looked up the name of the tasktracker with the map segment) 3) rare, the http server on the serving tasktracker is overloaded due to insufficient threads or listen backlog, this can happen if the number of fetches per reduce is large and the number of reduces or the number of maps is very large.

There are probably other cases, this recently happened to me when I had 6000 maps and 20 reducers on a 10 node cluster, which I believe was case 3 above. Since I didn't actually need to reduce ( I got my summary data via counters in the map phase) I never re-tuned the cluster.

编辑：原始答案说 "Ensure that your hostname is bound to the network IP and 127.0.0.1 in /etc/hosts"

获取失败次数过多

Too many fetch faliuers

hadoop