YARN 集群上的问题运行 Apache Giraph (Hadoop 2.5.2)

Question

我正在尝试运行在 Hadoop 2.5.2 上使用 Giraph 1.1 的基本 ShortestPaths 示例。我运行正在使用实际的集群模型（例如，不是伪分布式），我可以运行标准 mapreduce 作业。但是当我尝试运行 Giraph 示例时，它似乎挂起，除非我设置

-ca giraph.SplitMasterWorker=false

并相应地将 worker 数量设置为 1。但这有点违背运行ning 在集群上的意义，不是吗？ OTOH，如果我运行没有禁用 SplitMasterWorker，我会得到这个错误：

When using LocalJobRunner, you cannot run in split master / worker mode 
since there is only 1 task at a time!

这表明 Girpah 默认为本地模式。我读过的一份报告建议通过添加

来解决这个问题

-ca mapred.job.tracker=10.0.0.12:5431

到 Girpah 命令行，但是在带有 YARN 的 Hadoop 2.5.2 上，如果我理解正确的话，端口 5431 上没有 JobTracker。无论如何，如果我确实添加了那个位，作业 会尝试 到运行，但似乎没有完成就挂起。这是完整的命令行，作业输出如下：

[prhodes@ip-10-0-0-12 conf]$ hadoop jar /home/prhodes/giraph/giraph-
examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.5.2-jar-with-
dependencies.jar org.apache.giraph.GiraphRunner 
org.apache.giraph.examples.SimpleShortestPathsComputation -vif 
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat 
-vip /user/prhodes/input/tiny_graph.txt -vof 
org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op 
/user/prhodes/giraph_output/shortestpaths -w 3 -ca 
mapred.job.tracker=10.0.0.12:5431




15/03/10 03:18:59 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
15/03/10 03:19:02 INFO server.NIOServerCnxnFactory: binding to port 0.0.0.0/0.0.0.0:22181
15/03/10 03:19:02 INFO server.PrepRequestProcessor: zookeeper.skipACL=="yes", ACL checks will be skipped
15/03/10 03:19:05 INFO zk.ZooKeeperManager: onlineZooKeeperServers: Connect attempt 1 of 10 max trying to connect to ip-10-0-0-12.ec2.internal:22181 with poll msecs = 3000
15/03/10 03:19:05 INFO zk.ZooKeeperManager: onlineZooKeeperServers: Connected to ip-10-0-0-12.ec2.internal/10.0.0.12:22181!
15/03/10 03:19:05 INFO zk.ZooKeeperManager: onlineZooKeeperServers: Creating my filestamp _bsp/_defaultZkManagerDir/job_local1346154675_0001/_zkServer/ip-10-0-0-12.ec2.internal 0
15/03/10 03:19:05 INFO server.NIOServerCnxnFactory: Accepted socket connection from /10.0.0.12:45182
15/03/10 03:19:05 INFO graph.GraphTaskManager: setup: Chosen to run ZooKeeper...
15/03/10 03:19:05 INFO graph.GraphTaskManager: setup: Starting up BspServiceMaster (master thread)...
15/03/10 03:19:05 INFO bsp.BspService: BspService: Path to create to halt is /_hadoopBsp/job_local1346154675_0001/_haltComputation
15/03/10 03:19:05 INFO bsp.BspService: BspService: Connecting to ZooKeeper with job job_local1346154675_0001, 0 on ip-10-0-0-12.ec2.internal:22181
15/03/10 03:19:05 INFO zookeeper.ClientCnxn: Opening socket connection to server ip-10-0-0-12.ec2.internal/10.0.0.12:22181. Will not attempt to authenticate using SASL (unknown error)
15/03/10 03:19:05 INFO server.NIOServerCnxnFactory: Accepted socket connection from /10.0.0.12:45183
15/03/10 03:19:05 INFO zookeeper.ClientCnxn: Socket connection established to ip-10-0-0-12.ec2.internal/10.0.0.12:22181, initiating session
15/03/10 03:19:05 INFO server.ZooKeeperServer: Client attempting to establish new session at /10.0.0.12:45183
15/03/10 03:19:05 INFO persistence.FileTxnLog: Creating new log file: log.1
15/03/10 03:19:05 INFO server.ZooKeeperServer: Established session 0x14c01b158f00000 with negotiated timeout 600000 for client /10.0.0.12:45183
15/03/10 03:19:05 INFO zookeeper.ClientCnxn: Session establishment complete on server ip-10-0-0-12.ec2.internal/10.0.0.12:22181, sessionid = 0x14c01b158f00000, negotiated timeout = 600000
15/03/10 03:19:05 INFO bsp.BspService: process: Asynchronous connection complete.
15/03/10 03:19:05 INFO graph.GraphTaskManager: map: No need to do anything when not a worker
15/03/10 03:19:05 INFO graph.GraphTaskManager: cleanup: Starting for MASTER_ZOOKEEPER_ONLY
15/03/10 03:19:05 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0x1 zxid:0x2 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_masterElectionDir Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_local1346154675_0001/_masterElectionDir
15/03/10 03:19:05 INFO master.BspServiceMaster: becomeMaster: First child is '/_hadoopBsp/job_local1346154675_0001/_masterElectionDir/ip-10-0-0-12.ec2.internal_00000000000' and my bid is '/_hadoopBsp/job_local1346154675_0001/_masterElectionDir/ip-10-0-0-12.ec2.internal_00000000000'
15/03/10 03:19:05 INFO netty.NettyServer: NettyServer: Using execution group with 8 threads for requestFrameDecoder.
15/03/10 03:19:05 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/03/10 03:19:05 INFO netty.NettyServer: start: Started server communication server: ip-10-0-0-12.ec2.internal/10.0.0.12:30000 with up to 16 threads on bind attempt 0 with sendBufferSize = 32768 receiveBufferSize = 524288
15/03/10 03:19:05 INFO netty.NettyClient: NettyClient: Using execution handler with 8 threads after request-encoder.
15/03/10 03:19:05 INFO master.BspServiceMaster: becomeMaster: I am now the master!
15/03/10 03:19:05 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0xe zxid:0x9 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0 Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0
15/03/10 03:19:05 INFO bsp.BspService: process: applicationAttemptChanged signaled
15/03/10 03:19:05 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0x16 zxid:0xc txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1 Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1
15/03/10 03:19:05 WARN bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir, type=NodeChildrenChanged, state=SyncConnected)
15/03/10 03:19:07 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map
15/03/10 03:19:07 INFO job.HaltApplicationUtils$DefaultHaltInstructionsWriter: writeHaltInstructions: To halt after next superstep execute: 'bin/halt-application --zkServer ip-10-0-0-12.ec2.internal:22181 --zkNode /_hadoopBsp/job_local1346154675_0001/_haltComputation'
15/03/10 03:19:07 INFO mapreduce.Job: Running job: job_local1346154675_0001
15/03/10 03:19:08 INFO mapreduce.Job: Job job_local1346154675_0001 running in uber mode : false
15/03/10 03:19:08 INFO mapreduce.Job:  map 25% reduce 0%
15/03/10 03:19:10 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map
15/03/10 03:19:19 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map
15/03/10 03:19:28 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map
15/03/10 03:19:35 INFO master.BspServiceMaster: checkWorkers: Only found 0 responses of 3 needed to start superstep -1.  Reporting every 30000 msecs, 569976 more msecs left before giving up.
15/03/10 03:19:35 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0x22 zxid:0x10 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir
15/03/10 03:19:35 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0x23 zxid:0x11 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir
15/03/10 03:19:40 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map

Answer 1

好的，事实证明这很简单。我使用 hadoop_2 配置文件构建 Giraph，而不是 hadoop_yarn。当我使用 yarn profile 构建它时，这不再发生。我不明白这是如何工作的整个机制，但显然使用该配置文件构建会更改一些默认值，从而在运行时将其置于纯 YARN 模式。

所以，如果你得到这个，重建使用

mvn -Phadoop_yarn clean package

这可能会解决问题。

YARN 集群上的问题运行 Apache Giraph (Hadoop 2.5.2)

Trouble running Apache Giraph on YARN cluster (Hadoop 2.5.2)

java

hadoop

graph

bigdata

giraph

YARN 集群上的问题 运行 Apache Giraph (Hadoop 2.5.2)

Trouble running Apache Giraph on YARN cluster (Hadoop 2.5.2)

java

hadoop

graph

bigdata

giraph

YARN 集群上的问题运行 Apache Giraph (Hadoop 2.5.2)