YARN 作业访问的资源似乎少于 Ambari YARN 经理报告的资源

Question

在尝试运行 YARN 进程时感到困惑并出现错误。查看 ambari UI YARN 部分，看到... （注意它说 60GB 可用）。然而，当尝试运行 YARN 进程时，收到错误指示可用资源少于 ambari 中报告的资源，请参阅...

➜  h2o-3.26.0.2-hdp3.1 hadoop jar h2odriver.jar -nodes 4 -mapperXmx 5g -output /home/ml1/hdfsOutputDir
Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: 192.168.122.1]
    [Possible callback IP address: 172.18.4.49]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46721
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
    mapreduce.map.java.opts:     -Xms5g -Xmx5g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     5632
Hive driver not present, not generating token.
19/08/07 12:37:19 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/08/07 12:37:19 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/08/07 12:37:19 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1/.staging/job_1565057088651_0007
19/08/07 12:37:21 INFO mapreduce.JobSubmitter: number of splits:4
19/08/07 12:37:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1565057088651_0007
19/08/07 12:37:21 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/08/07 12:37:21 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/08/07 12:37:21 INFO impl.YarnClientImpl: Submitted application application_1565057088651_0007
19/08/07 12:37:21 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/
Job name 'H2O_80092' submitted
JobTracker job ID is 'job_1565057088651_0007'
For YARN users, logs command is 'yarn logs -applicationId application_1565057088651_0007'
Waiting for H2O cluster to come up...
19/08/07 12:37:38 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/08/07 12:37:38 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200

----- YARN cluster metrics -----
Number of YARN worker nodes: 4

----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://hw05.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used

----- Queues -----
Queue name:            default
    Queue state:       RUNNING
    Current capacity:  0.08
    Capacity:          1.00
    Maximum capacity:  1.00
    Application count: 1
    ----- Applications in this queue -----
    Application ID:                  application_1565057088651_0007 (H2O_80092)
        Started:                     ml1 (Wed Aug 07 12:37:21 HST 2019)
        Application state:           FINISHED
        Tracking URL:                http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/
        Queue name:                  default
        Used/Reserved containers:    1 / 0
        Needed/Used/Reserved memory: 5.0 GB / 5.0 GB / 0.0 GB
        Needed/Used/Reserved vcores: 1 / 1 / 0

Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12 vcores used

----------------------------------------------------------------------

ERROR: Unable to start any H2O nodes; please contact your YARN administrator.

       A common cause for this is the requested container size (5.5 GB)
       exceeds the following YARN settings:

           yarn.nodemanager.resource.memory-mb
           yarn.scheduler.maximum-allocation-mb

----------------------------------------------------------------------

For YARN users, logs command is 'yarn logs -applicationId application_1565057088651_0007'

注意

ERROR: Unable to start any H2O nodes; please contact your YARN administrator.

A common cause for this is the requested container size (5.5 GB) exceeds the following YARN settings:
  yarn.nodemanager.resource.memory-mb
  yarn.scheduler.maximum-allocation-mb

然而，我已经配置了 YARN

yarn.scheduler.maximum-allocation-vcores=3
yarn.nodemanager.resource.cpu-vcores=3
yarn.nodemanager.resource.memory-mb=15GB
yarn.scheduler.maximum-allocation-mb=15GB

我们可以看到容器和节点资源限制都高于请求的容器大小。

尝试使用默认的 mapreduce pi 示例进行更复杂的计算

[myuser@HW03 ~]$ yarn jar /usr/hdp/3.1.0.0-78/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 1000 1000
Number of Maps  = 1000
Samples per Map = 1000
....

并检查 RM UI，我发现在某些情况下至少可以使用 RM 的所有 60GB 资源（注意图像底部的 61440MB）

所以这个问题有些地方我不明白

Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12 vcores used

我想使用 YARN 表面上可以提供的全部 60GB（或者至少可以选择，而不是抛出错误）。认为应该有足够的资源让 4 个节点中的每一个节点为进程提供 15GB（> 请求的 4x5GB=20GB）。我在这里错过了什么吗？请注意，我只有 YARN 的默认根队列设置？
----- Nodes -----

Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3 vcores used

Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used

....

为什么在报错前只有一个节点被使用？

从这两件事来看，似乎既没有超过 15GB 节点限制，也没有超过 60GB 集群限制，那么为什么会抛出这些错误？我在这里误解了这种情况怎么办？可以做些什么来修复（同样，希望能够将所有明显的 60GB YARN 资源用于作业而不会出错）？有任何修复的调试建议吗？

更新:

问题似乎与 How to properly change uid for HDP / ambari-created user? and the fact that having a user exist on a node and have a hdfs://user/<username> directory with correct permissions (as I was lead to believe from a ) 相关，不足以在集群上被确认为 "existing"。

运行存在于所有集群节点上的不同用户（在本例中为 Ambari 创建的 hdfs 用户）的 hadoop jar 命令（即使 Ambari 创建的该用户在节点间具有不同的 uid（IDK如果这是一个问题））并且有一个 hdfs://user/hdfs 目录，发现 h2o jar 运行正如预期的那样。

我最初的印象是用户只需要存在于正在使用的任何客户端机器上，外加 hdfs://user/ 目录（参见 https://community.cloudera.com/t5/Support-Questions/Adding-a-new-user-to-the-cluster/m-p/130319/highlight/true#M93005）。由此产生的一个令人担忧/令人困惑的事情是，Ambari 显然在具有不同 uid 和 gid 值的各种集群节点上创建了 hdfs 用户，例如...

[root@HW01 ~]# clush -ab id hdfs
---------------
HW[01-04] (4)
---------------
uid=1017(hdfs) gid=1005(hadoop) groups=1005(hadoop),1003(hdfs)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)
[root@HW01 ~]# 
[root@HW01 ~]#
# wondering what else is using a uid 1021 across the nodes 
[root@HW01 ~]# clush -ab id 1021
---------------
HW[01-04] (4)
---------------
uid=1021(hbase) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)

这似乎不是它应该的样子（只是我怀疑使用 MapR（它要求 uid 和 gids 在节点之间相同）并看这里：https://www.ibm.com/support/knowledgecenter/en/STXKQY_BDA_SHR/bl1adv_userandgrpid.htm） .请注意，HW05 是后来添加的节点。如果这在 HDP 中真的没问题，我计划只添加我实际缩进的用户，以便在具有任意 uid 和 gid 值的所有节点上使用 h2o。对此有什么想法吗？任何文档来支持为什么这是对的或错的你可以 link 我吗？

在作为答案发布之前，将对此进行更多研究。我认为基本上需要进一步澄清 HDP 何时考虑用户 "exist" 在集群上。

Answer 1

问题似乎与如何正确更改 HDP/ambari 创建的用户的 uid？以及让用户存在于节点上并拥有具有正确权限的 hdfs://user/ 目录这一事实（正如我从 Hortonworks 论坛 post 所相信的那样）不足以被确认为 "existing" 在集群上。这与我与 Hortonworks 专家的讨论相得益彰，他们说 使用 YARN 的用户必须存在于集群的所有数据节点上 .

运行存在于所有集群节点上的不同用户（在本例中为 Ambari 创建的 hdfs 用户）的 hadoop jar 命令（即使 Ambari 创建的该用户在节点间具有不同的 uid（IDK如果这是一个问题））并且有一个 hdfs://user/hdfs 目录，发现 h2o jar 运行正如预期的那样。

我最初的印象是用户只需要存在于正在使用的任何客户端机器上，再加上需要 hdfs://user/ 目录（请参阅 https://community.cloudera.com/t5/Support-Questions/Adding-a-new-user-to-the-cluster/m-p/130319/highlight/true#M93005）。

旁注:

一个令人担忧/令人困惑的事情是 Ambari 显然在具有不同 uid 和 gid 值的各种集群节点上创建了 hdfs 用户，例如...

[root@HW01 ~]# clush -ab id hdfs
---------------
HW[01-04] (4)
---------------
uid=1017(hdfs) gid=1005(hadoop) groups=1005(hadoop),1003(hdfs)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)
[root@HW01 ~]# 
[root@HW01 ~]#
# wondering what else is using a uid 1021 across the nodes 
[root@HW01 ~]# clush -ab id 1021
---------------
HW[01-04] (4)
---------------
uid=1021(hbase) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)

这似乎不是它应该的样子（只是我怀疑使用 MapR（它要求 uid 和 gids 在节点之间相同）并看这里：https://www.ibm.com/support/knowledgecenter/en/STXKQY_BDA_SHR/bl1adv_userandgrpid.htm） .请注意，HW05 是后来添加的节点。如果这在 HDP 中真的没问题，我计划只添加我实际缩进的用户，以便在具有任意 uid 和 gid 值的所有节点上使用 h2o。对此有什么想法吗？任何文档来支持为什么这是对的或错的你可以 link 我吗？

在这里更深入地研究一下：

YARN 作业访问的资源似乎少于 Ambari YARN 经理报告的资源

YARN job appears to have access to less resources than Ambari YARN manager reports

hadoop-yarn

h2o

ambari

hdp