Hortonworks Hadoop NN 和 RM 堆在高利用率时卡住过载,但没有应用程序 运行? (java.io.IOException:设备上没有剩余 space)

Hortonworks Hadoop NN and RM heap stuck overloaded at high utilization, but no applications running? (java.io.IOException: No space left on device)

最近有一些 Spark 作业从 Hadoop (HDP-3.1.0.0) 客户端节点发起,引发了一些

Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device

错误,现在我看到 NN 和 RM 堆似乎停留在高利用率水平(例如 80-95%),尽管 RM/YARN 中有待处理的作业或 运行 UI.

在我看到的 Ambari 仪表板上

然而在 RM UI 中,似乎什么都没有 运行:

我看到在最近失败的 Spark 作业中报告的错误是...

[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO YarnScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool
[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: ResultStage 10 (csv at NativeMethodAccessorImpl.java:0) finished in 8.558 s
[2021-02-11 22:05:20,982] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: Job 7 finished: csv at NativeMethodAccessorImpl.java:0, took 8.561029 s
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Job null committed.
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Finished processing stats for job null.
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - writing to local FS staging area
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO - Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:262)
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO -     at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at java.io.DataOutputStream.write(DataOutputStream.java:107)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:96)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:342)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:277)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:262)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:352)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:441)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:305)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:369)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:257)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:228)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - Caused by: java.io.IOException: No space left on device
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at java.io.FileOutputStream.writeBytes(Native Method)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at java.io.FileOutputStream.write(FileOutputStream.java:326)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:260)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     ... 29 more
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO - Traceback (most recent call last):
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO -   File "/home/airflow/projects/hph_etl_airflow/common_prep.py", line 112, in <module>
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO -     assert get.returncode == 0, "ERROR: failed to copy to local dir"
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO - AssertionError: ERROR: failed to copy to local dir
[2021-02-11 22:05:24,034] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Invoking stop() from shutdown hook
[2021-02-11 22:05:24,040] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO AbstractConnector: Stopped Spark@599cff94{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
[2021-02-11 22:05:24,048] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkUI: Stopped Spark web UI at http://airflowetl.ucera.local:4041
[2021-02-11 22:05:24,092] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Interrupting monitor thread
[2021-02-11 22:05:24,106] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Shutting down all executors
[2021-02-11 22:05:24,107] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - (serviceOption=None,
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO -  services=List(),
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO -  started=false)
[2021-02-11 22:05:24,115] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Stopped
[2021-02-11 22:05:24,123] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
[2021-02-11 22:05:24,154] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MemoryStore: MemoryStore cleared
[2021-02-11 22:05:24,155] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManager: BlockManager stopped
[2021-02-11 22:05:24,157] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManagerMaster: BlockManagerMaster stopped
[2021-02-11 22:05:24,162] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
[2021-02-11 22:05:24,173] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Successfully stopped SparkContext
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Shutdown hook called
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-57e1dfa3-26e8-490b-b7ca-94bce93e36d7
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506/pyspark-225760d8-f365-49fe-8333-6d0df3cb99bd
[2021-02-11 22:05:24,646] {bash_operator.py:132} INFO - Command exited with return code 1
[2021-02-11 22:05:24,663] {taskinstance.py:1088} ERROR - Bash command failed

注意:无法进行更多调试,因为通过 Ambari 重新启动了集群(某些日常任务需要它,所以不能离开它)并将 NN 和 RM 堆设置为 10% 并且分别为 25%。

有人知道这里会发生什么吗?任何其他可以(仍然)检查进一步调试信息的地方?

运行 df -hdu -h -d1 /some/paths/of/interest 在执行 Spark 调用的机器上,只是从“写入本地 FS”和“磁盘上没有 space”进行猜测” 错误消息(运行 clush -ab df -h / 所有hadoop节点,我可以看到启动Spark作业的客户端节点是唯一一个磁盘利用率高的节点),我发现只有1GB磁盘 space 保留在调用 Spark 作业的机器上(由于其他问题)最终为其中一些作业抛出此错误并已修复该问题,但不确定是否相关(如我的理解是 Spark 在集群中的其他节点上进行实际处理。

我怀疑this was the problem,但是如果有更多经验的人可以解释更多这里表面下出了什么问题,那将对以后的调试和更好的实际答案非常有帮助对此 post。例如

  1. 为什么集群节点之一(在本例中为客户端节点)上缺少可用磁盘 space 会导致 RM 堆 保持即使在 RM 运行 中没有报告工作 UI?
  2. ,利用率仍然很高
  3. 为什么本地机器上缺少磁盘 space 会影响 Spark 作业(我的理解是 Spark 在集群中的其他节点上进行实际处理)?

如果调用 spark 作业的本地计算机上的磁盘 space 确实是问题所在,则此问题可能会被标记为与此处回答的问题重复:.