giraph.numInputThreads "input superstep" 的执行时间使用 1 个或 8 个线程是一样的,这怎么可能?

giraph.numInputThreads execution time for "input superstep" it's the same using 1 or 8 threads, how this can be possible?

我正在通过维基百科(西班牙语版)网站进行 BFS 搜索。我将 dump 转换成可以用 Giraph 读取的文件。

使用 1 个工作人员,一个 1 GB 的文件需要 452 秒。我用这个命令执行了 Giraph:

/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true

容器日志:

16/08/24 21:17:02 INFO master.BspServiceMaster: generateVertexInputSplits: Got 8 input splits for 1 input threads
16/08/24 21:17:02 INFO master.BspServiceMaster: createVertexInputSplits: Starting to write input split data to zookeeper with 1 threads
16/08/24 21:17:02 INFO master.BspServiceMaster: createVertexInputSplits: Done writing input split data to zookeeper
16/08/24 21:17:02 INFO yarn.GiraphYarnTask: [STATUS: task-0] MASTER_ZOOKEEPER_ONLY checkWorkers: Done - Found 1 responses of 1 needed to start superstep -1
16/08/24 21:17:02 INFO netty.NettyClient: Using Netty without authentication.
16/08/24 21:17:02 INFO netty.NettyClient: connectAllAddresses: Successfully added 1 connections, (1 total connected) 0 failed, 0 failures total.
16/08/24 21:17:02 INFO partition.PartitionUtils: computePartitionCount: Creating 1, default would have been 1 partitions.
...
16/08/24 21:25:40 INFO netty.NettyClient: stop: Halting netty client
16/08/24 21:25:40 INFO netty.NettyClient: stop: reached wait threshold, 1 connections closed, releasing resources now.
16/08/24 21:25:43 INFO netty.NettyClient: stop: Netty client halted
16/08/24 21:25:43 INFO netty.NettyServer: stop: Halting netty server
16/08/24 21:25:43 INFO netty.NettyServer: stop: Start releasing resources
16/08/24 21:25:44 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled
16/08/24 21:25:47 INFO netty.NettyServer: stop: Netty server halted
16/08/24 21:25:47 INFO bsp.BspService: process: masterElectionChildrenChanged signaled
16/08/24 21:25:47 INFO master.MasterThread: setup: Took 0.898 seconds.
16/08/24 21:25:47 INFO master.MasterThread: input superstep: Took 452.531 seconds.
16/08/24 21:25:47 INFO master.MasterThread: superstep 0: Took 64.376 seconds.
16/08/24 21:25:47 INFO master.MasterThread: superstep 1: Took 1.591 seconds.
16/08/24 21:25:47 INFO master.MasterThread: shutdown: Took 6.609 seconds.
16/08/24 21:25:47 INFO master.MasterThread: total: Took 526.006 seconds.

如你们所见,第一行告诉我们输入超步仅使用 一个 线程执行。并用了 492 秒完成输入超步。

我做了另一个测试,使用 giraph.numInputThreads=8,尝试用 8 个线程做输入超步:

/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.numInputThreads=8

结果如下:

    16/08/24 21:54:00 INFO master.BspServiceMaster: generateVertexInputSplits: Got 8 input splits for 8 input threads
16/08/24 21:54:00 INFO master.BspServiceMaster: createVertexInputSplits: Starting to write input split data to zookeeper with 1 threads
16/08/24 21:54:00 INFO master.BspServiceMaster: createVertexInputSplits: Done writing input split data to zookeeper
...

16/08/24 22:10:07 INFO master.MasterThread: setup: Took 0.093 seconds.
16/08/24 22:10:07 INFO master.MasterThread: input superstep: Took 891.339 seconds.
16/08/24 22:10:07 INFO master.MasterThread: superstep 0: Took 66.635 seconds.
16/08/24 22:10:07 INFO master.MasterThread: superstep 1: Took 1.837 seconds.
16/08/24 22:10:07 INFO master.MasterThread: shutdown: Took 6.605 seconds.
16/08/24 22:10:07 INFO master.MasterThread: total: Took 966.512 seconds.

所以,我的问题是,Giraph 怎么可能在没有输入线程的情况下使用 452 秒而在有输入线程的情况下使用 891 秒?应该正好相反吧?

用于此的集群是 1 个主从,都是 AWS 上的 r3.8xlarge EC2 实例。

问题与 HDFS 访问有关。有 8 个线程访问同一个资源,而该资源只能以顺序方式访问。为获得最佳性能,giraph.numInputThreads 应为 2 或 1。