Giraph 应用程序卡住,在 superstep 4,所有工作人员都处于活动状态但没有进展

Giraph application get stuck, on superstep 4, all workers active but without progress

我正在通过维基百科(西班牙语版)网站进行 BFS 搜索。我将转储 (https://dumps.wikimedia.org/eswiki/20160601) 转换为可以使用 Giraph 读取的文件。

BFS正在搜索路径,一切正常,直到卡在第四步的某个点。

我在 AWS 上使用 5 个节点的集群(4 个从核心,1 个主节点)。每个节点都是一个 r3.8xlarge ec2 实例。执行BFS的命令是这个:

/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true,giraph.maxMessagesInMemory=1000000000,giraph.isStaticGraph=true,giraph.logLevel=Debug

每个容器有120GB(差不多)。我在 outOfCore 中使用 1000M 消息限制,因为我认为这是问题所在,但显然不是。

这些是主日志(看起来是在等待工人完成但他们就是不......并且永远这样......):

6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got finished worker list = [], size = 0, worker list = [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
...thirty times same last two lines...
...
6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got finished worker list = [], size = 0, worker list = [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4

并且在所有工作人员中,没有关于正在发生的事情的信息(我正在使用 giraph.logLevel=Debug 进行测试,因为 giraph 日志的默认级别我迷路了),工人们一遍又一遍地说:

16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result not ready yet java.util.concurrent.FutureTask@7392f34d
16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82

在开始superstep 4之前,每个worker的信息如下

16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2] startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep: addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID
=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)]
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory (free/total/max) = 92421.41M / 115000.00M / 115000.00M

我不知道究竟是什么失败了:

正如你所怀疑的,我的假设太多了,这就是我寻求帮助的原因,所以我可以朝着正确的方向前进。

编辑 1:使用 giraph.maxNumberOfOpenRequests 和 giraph.waitForRequestsConfirmation=true 没有解决问题

编辑 2: 我复制了 netty 线程,并将原始大小的两倍分配给 netty 缓冲区,没有任何变化。

编辑 3:我将 1000 条消息压缩为 1 条,得到的消息少了很多,但最终结果仍然相同。

我使用 15 个计算线程和 240 个分区来隔离问题。

我可以观察到一个分区和一个线程花费的时间太长。

所以我检查了代码,寻找可能需要那么长时间的东西。我发现了,我使用 + 而不是 StringBuilder 连接字符串(我相信编译器将“+”切换为 StringBuilder.append(),正如有人建议的那样 here)。但事实并非如此。时间急剧减少,应用程序终于完成。