H2O集群启动频繁超时

H2O cluster startup frequently timing out

正在尝试通过 python

(MapR) hadoop 上启动 h2o 集群
# startup hadoop h2o cluster
import os
import subprocess
import h2o
import shlex
import re

from Queue import Queue, Empty
from threading import Thread

def enqueue_output(out, queue):
    """
    Function for communicating streaming text lines from seperate thread.
    see 
    """
    for line in iter(out.readline, b''):
        queue.put(line)
    out.close()

# clear legacy temp. dir.
hdfs_legacy_dir = '/mapr/clustername/user/mapr/hdfsOutputDir'
if os.path.isdir(hdfs_legacy_dir ):
    print subprocess.check_output(shlex.split('rm -r %s'%hdfs_legacy_dir ))

# start h2o service in background thread
local_h2o_start_path = '/home/mapr/h2o-3.18.0.2-mapr5.2/'
startup_p = subprocess.Popen(shlex.split('/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir'.format(local_h2o_start_path)), 
                             shell=False, 
                             stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()

# read line without blocking
h2o_url_out = ''
while True:
    try:  line = q.get_nowait() # or q.get(timeout=.1)
    except Empty:
        continue
    else: # got line
        print line
        # check for first instance connection url output
        if re.search('Open H2O Flow in your web browser', line) is not None:
            h2o_url_out = line
            break
        if re.search('Error', line) is not None:
            print 'Error generated: %s' % line
            sys.exit()

print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search('(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)', h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip

经常抛出超时错误

Waiting for H2O cluster to come up...
H2O node 172.18.4.66:54321 requested flatfile
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Error generated: ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Shutting down h2o cluster

查看文档 (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html)(只是对“timeout”这个词进行词查找),找不到任何有助于解决问题的方法(例如,通过 hadoop jar h2odriver.jar -timeout <some time> 延长超时时间只会延长时间,直到弹出超时错误。

注意到当 h2o 集群的另一个实例已经启动并且 运行(我不明白,因为我认为 YARN 可以支持多个实例)时,这种情况经常发生,而且 有时没有其他集群初始化时。

除了 h2o 抛出的错误消息之外,还有谁知道可以尝试解决此问题或获取更多调试信息的其他方法吗?


更新:

尝试从命令行重现问题,得到

[me@mnode01 project]$ /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: 172.18.4.62]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.62:29388
(You can override these with -driverif and -driverport/-driverportrange.)
Memory Settings:
    mapreduce.map.java.opts:     -Xms6g -Xmx6g -XX:PermSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     6758
18/08/15 09:18:46 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: number of splits:4
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523404089784_7404
18/08/15 09:18:48 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
18/08/15 09:18:48 INFO impl.YarnClientImpl: Submitted application application_1523404089784_7404
18/08/15 09:18:48 INFO mapreduce.Job: The url to track the job: https://mnode03.cluster.local:8090/proxy/application_1523404089784_7404/
Job name 'H2O_66888' submitted
JobTracker job ID is 'job_1523404089784_7404'
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
Waiting for H2O cluster to come up...
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
H2O node 172.18.4.66:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
Killed.
18/08/15 09:23:54 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032

----- YARN cluster metrics -----
Number of YARN worker nodes: 6

----- Nodes -----
Node: http://mnode03.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used,  0.0 / 7.0 GB used, 0 / 2 vcores used
Node: http://mnode05.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode06.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode01.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used,  0.0 / 5.0 GB used, 0 / 2 vcores used
Node: http://mnode04.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 7.0 / 10.4 GB used, 1 / 2 vcores used
Node: http://mnode02.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used,  2.0 / 8.7 GB used, 1 / 2 vcores used

----- Queues -----
Queue name:            root.default
    Queue state:       RUNNING
    Current capacity:  0.00
    Capacity:          0.00
    Maximum capacity:  -1.00
    Application count: 0

Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used

----------------------------------------------------------------------

WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR:   Only 3 out of the requested 4 worker containers were started due to YARN cluster resource limitations

----------------------------------------------------------------------

For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'

并注意后面的输出

WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB) 
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0) 
ERROR:   Only 3 out of the requested 4 worker containers were started due to YARN cluster

我对报告的 0GB 内存感到困惑。和 0 vcores 因为集群上没有其他应用程序 运行 并且在 YARN RM web UI 中查看集群详细信息显示

(使用图像,因为无法在日志文件中找到此信息的统一位置,以及为什么尽管没有其他 运行 应用程序,但内存可用性如此不平衡,我不知道)。在这一点上,应该提到没有太多的经验来修补/检查 YARN 配置,所以我现在很难找到相关信息。

可能是我用 -mapperXmx=6g 启动了 h2o 集群,但是(如图所示)其中一个节点只有 5g 内存。 available,所以如果这个节点被随机选择用于初始化 h2o 应用程序,它没有足够的内存来支持请求的映射器内存。?将启动命令更改为 /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 5g -timeout 300 -output hdfsOutputDir 和 start/stopping 多次而没有错误似乎支持这一理论(尽管需要进一步检查以确定我是否正确解释了事情)。

这很可能是因为您的 Hadoop 集群很忙,没有 space 来启动新的 yarn 容器。

如果您请求 N 个节点,那么您要么获得所有 N 个节点,要么像您看到的那样启动过程超时。您可以选择使用 -timeout 命令行标志来增加超时。