为什么 YARN 作业不转换到运行状态？

Question

我有很多我想要的 Samza 工作运行。我能拿到第一个运行 ok。但是，第二个工作似乎处于 ACCEPTED 状态，并且在我杀死第一个工作之前永远不会转换到运行状态。

这是来自 YARN 的视图 UI:

这是第二个作业的详细信息，您可以在其中看到没有分配任何节点：

我有 2 个数据节点，所以我应该能够运行多个作业。这是我的 yarn-site.xml 的相关部分（我在文件中唯一的其他配置是与 HA 配置、Zookeeper 等有关）：

<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
    <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>2048</value>
    <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
    <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>2</value>
    <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
</property>
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
    <description>Physical memory, in MB, to be made available to running containers</description>
</property>
<property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>4</value>
    <description>Number of CPU cores that can be allocated for containers.</description>
</property>

编辑：

我可以在资源管理器日志中看到：

2015-11-01 17:47:37,151 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1446300861747_0018_000001 container=Container: [ContainerId: container_1446300861747_0018_01_000002, NodeId: yarndata-01:41274, NodeHttpAddress: yarndata-01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>, usedCapacity=0.125, absoluteUsedCapacity=0.125, numApps=1, numContainers=1 clusterResource=<memory:8192, vCores:8>
2015-11-01 17:47:37,151 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2048, vCores:2>, usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=2
2015-11-01 17:47:37,151 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 used=<memory:2048, vCores:2> cluster=<memory:8192, vCores:8>
2015-11-01 17:47:37,658 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : yarndata-01:41274 for container : container_1446300861747_0018_01_000002
2015-11-01 17:47:37,659 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1446300861747_0018_01_000002 Container Transitioned from ALLOCATED to ACQUIRED
2015-11-01 17:47:39,154 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1446300861747_0018_01_000002 Container Transitioned from ACQUIRED to RUNNING
2015-11-01 17:48:03,821 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 19
2015-11-01 17:48:04,339 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 19 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead.
2015-11-01 17:48:04,339 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 19 submitted by user www-data
2015-11-01 17:48:04,339 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=www-data IP=192.168.2.81 OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1446300861747_0019
2015-11-01 17:48:04,340 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1446300861747_0019
2015-11-01 17:48:04,340 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1446300861747_0019 State change from NEW to NEW_SAVING
2015-11-01 17:48:04,340 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1446300861747_0019
2015-11-01 17:48:04,342 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1446300861747_0019 State change from NEW_SAVING to SUBMITTED
2015-11-01 17:48:04,342 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application added - appId: application_1446300861747_0019 user: www-data leaf-queue of parent: root #applications: 2
2015-11-01 17:48:04,342 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Accepted application application_1446300861747_0019 from user: www-data, in queue: default
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1446300861747_0019 State change from SUBMITTED to ACCEPTED
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1446300861747_0019_000001
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1446300861747_0019_000001 State change from NEW to SUBMITTED
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application added - appId: application_1446300861747_0019 user: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@202c5cd5, leaf-queue: default #user-pending-applications: 1 #user-active-applications: 1 #queue-pending-applications: 1 #queue-active-applications: 1

请问我哪里做错了？

Answer 1

答案在于资源管理器说没有足够的资源来创建新的 samza 容器和应用程序主机。

我将 capacity-scheduler.xml 中的 yarn.scheduler.capacity.maximum-am-resource-percent 的值更改为大于默认值 0.1。

此参数的 documentation 声明：

Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running applications.

为什么 YARN 作业不转换到运行状态？

Why does YARN job not transition to RUNNING state?

hadoop

hadoop-yarn

apache-samza

为什么 YARN 作业不转换到 运行 状态？

Why does YARN job not transition to RUNNING state?

hadoop

hadoop-yarn

apache-samza

为什么 YARN 作业不转换到运行状态？