yarn java 进程未终止

yarn java process not killed

我已经安装了 Apache Samza,它使用 Yarn 来管理作业。它在虚拟机上的两个 Debian 服务器上是 运行。 Samza 是 0.9.1 版本。 Hadoop 的版本是 2.6.0。我看到两个不同的问题,我不确定它们是否相关,但看起来 Yarn 都没有做它应该做的事情。

纱-site.xml:

<configuration>
<property>
 <name>yarn.resourcemanager.hostname</name>
 <value>kfk-samza01</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>3</value>
</property>
</configuration>

在我配置的作业选项文件中添加了以下内容:

yarn.container.memory.mb=256
yarn.am.container.memory.mb=256

task.opts= -Xms128M -Xmx128M

当作业 运行 时,我可以看到 -Xms128M -Xmx128M 选项被忽略并使用默认值。

我看到了以下错误。似乎某些内存限制阻止作业从已接受到 运行,但我找不到解决方法。

Container [pid=23007,containerID=container_1443454508386_0003_01_000001] is running beyond virtual memory limits. Current usage: 13.9 MB of 256 MB physical memory used; 1.1 GB of 537.6 MB virtual memory used. Killing container

实际上作业只是干净的函数,所以我的 none 代码应该引入噪音。

知道问题出在哪里吗?

更新: 在ACCEPTED状态停留10分钟左右后就进入FAILED。 这是我在 yarn-root-resourcemanager-kfk-samza01.out 日志中看到的一部分:

2015-09-30 14:08:07,000 INFO  [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root  OPERATION=AM Allocated Container     TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1443613686881_0001    CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:08:07,000 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(153)) - Assigned container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168, vCores:7> available after allocation
2015-09-30 14:08:07,001 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:assignContainer(1580)) - assignedContainer application attempt=appattempt_1443613686881_0001_000002 container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 clusterResource=<memory:16384, vCores:16>
2015-09-30 14:08:07,002 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainersToChildQueues(559)) - Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>, usedCapacity=0.0625, absoluteUsedCapacity=0.0625, numApps=1, numContainers=1
2015-09-30 14:08:07,002 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainers(424)) - assignedContainer queue=root usedCapacity=0.0625 absoluteUsedCapacity=0.0625 used=<memory:1024, vCores:1> cluster=<memory:16384, vCores:16>
2015-09-30 14:08:07,005 INFO  [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : kfk-samza01:44816 for container : container_1443613686881_0001_02_000001
2015-09-30 14:08:07,009 INFO  [AsyncDispatcher event handler] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ALLOCATED to ACQUIRED
2015-09-30 14:08:07,009 INFO  [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,010 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(1830)) - Storing attempt: AppId: application_1443613686881_0001 AttemptId: appattempt_1443613686881_0001_000002 MasterContainer: Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ]
2015-09-30 14:08:07,010 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from SCHEDULED to ALLOCATED_SAVING
2015-09-30 14:08:07,011 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED_SAVING to ALLOCATED
2015-09-30 14:08:07,012 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:run(253)) - Launching masterappattempt_1443613686881_0001_000002
2015-09-30 14:08:07,018 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(106)) - Setting up container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,019 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:createAMContainerLaunchContext(191)) - Command to launch container container_1443613686881_0001_02_000001 : export SAMZA_LOG_DIR=<LOG_DIR> && ln -sfn <LOG_DIR> logs && exec ./__package/bin/run-am.sh 1>logs/stdout 2>logs/stderr
2015-09-30 14:08:07,020 INFO  [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,020 INFO  [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,064 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(127)) - Done launching container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,065 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED to LAUNCHED
2015-09-30 14:08:08,001 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ACQUIRED to RUNNING
2015-09-30 14:21:26,930 INFO  [Ping Checker] util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(127)) - Expired:appattempt_1443613686881_0001_000002 Timed out after 600 secs
2015-09-30 14:21:26,931 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1125)) - Updating application attempt appattempt_1443613686881_0001_000002 with final state: FAILED, and exit status: -1000
2015-09-30 14:21:26,931 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from LAUNCHED to FINAL_SAVING
2015-09-30 14:21:26,932 INFO  [AsyncDispatcher event handler] resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(677)) - Unregistering app attempt : appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,932 INFO  [AsyncDispatcher event handler] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,933 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,933 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(1208)) - The number of failed attempts is 2. The max attempts is 2
2015-09-30 14:21:26,935 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:rememberTargetTransitionsAndStoreState(995)) - Updating application application_1443613686881_0001 with final state: FAILED
2015-09-30 14:21:26,937 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from ACCEPTED to FINAL_SAVING
2015-09-30 14:21:26,938 INFO  [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(790)) - Application Attempt appattempt_1443613686881_0001_000002 is done. finalState=FAILED
2015-09-30 14:21:26,938 INFO  [AsyncDispatcher event handler] recovery.RMStateStore (RMStateStore.java:transition(161)) - Updating info for app: application_1443613686881_0001
2015-09-30 14:21:26,939 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from RUNNING to KILLED
2015-09-30 14:21:26,939 INFO  [ResourceManager Event Processor] fica.FiCaSchedulerApp (FiCaSchedulerApp.java:containerCompleted(113)) - Completed container: container_1443613686881_0001_02_000001 in state: KILLED event:KILL
2015-09-30 14:21:26,939 INFO  [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root  OPERATION=AM Released Container      TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1443613686881_0001    CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:21:26,940 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(216)) - Released container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:8> available, release resources=true
2015-09-30 14:21:26,940 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(945)) - Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application.
2015-09-30 14:21:26,940 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:releaseResource(1732)) - default used=<memory:0, vCores:0> numContainers=0 user=root user-resources=<memory:0, vCores:0>
2015-09-30 14:21:26,943 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:completedContainer(1683)) - completedContainer container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,943 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(604)) - completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,944 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(622)) - Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0
2015-09-30 14:21:26,944 INFO  [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1274)) - Application attempt appattempt_1443613686881_0001_000002 released container container_1443613686881_0001_02_000001 on node: host: kfk-samza01:44816 #containers=0 available=8192 used=0 with event: KILL
2015-09-30 14:21:26,945 INFO  [ResourceManager Event Processor] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(115)) - Application application_1443613686881_0001 requests cleared
2015-09-30 14:21:26,945 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(682)) - Application removed - appId: application_1443613686881_0001 user: root queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2015-09-30 14:21:26,946 INFO  [pool-1-thread-4] amlauncher.AMLauncher (AMLauncher.java:run(267)) - Cleaning master appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,948 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,949 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:removeApplication(372)) - Application removed - appId: application_1443613686881_0001 user: root leaf-queue of parent: root #applications: 0
2015-09-30 14:21:26,951 WARN  [AsyncDispatcher event handler] resourcemanager.RMAuditLogger (RMAuditLogger.java:logFailure(263)) - USER=root    OPERATION=Application Finished - Failed      TARGET=RMAppManager     RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED       PERMISSIONS=Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application.  APPID=application_1443613686881_0001
2015-09-30 14:21:26,955 INFO  [AsyncDispatcher event handler] resourcemanager.RMAppManager$ApplicationSummary (RMAppManager.java:logAppSummary(179)) - appId=application_1443613686881_0001,name=flow.Router_1,user=root,queue=default,state=FAILED,trackingUrl=http://kfk-samza01:8088/cluster/app/application_1443613686881_0001,appMasterHost=N/A,startTime=1443614243319,finishTime=1443615686935,finalStatus=FAILED

有什么线索吗?

请尝试以下作业配置属性来限制容器内存分配。

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb

根据您的情况,这两个属性值可以是 256MB

同时配置以下两个属性,

mapreduce.map.java.opts
mapreduce.reduce.java.opts

根据您的情况,这 2 个属性的值应为 128MB

[注意:以上两个*.java.opts值必须略低于各自的*.memory.mb属性]

如果您仍然遇到虚拟内存问题,请尝试通过配置以下 属性.

来降低虚拟内存分配的比率值
yarn.nodemanager.vmem-pmem-ratio

默认值为 2.1,如果您仍然遇到虚拟内存问题,请尝试降低它。

正确设置这些属性后,您将在成功完成后清除容器。

希望对您有所帮助。

最后我同时遇到了两个问题。第一,已解决的内存限制已被 hserus 友好地解释。

另一个是导致主题损坏的 kafka 服务器的通信问题,因此作业无法 运行。