获取节点数、代码数和可用 RAM 以进行调整

Question

我正在尝试调整我的 HPC 集群（我使用 Sparklyr）并且我尝试收集 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 指定的一些重要规范：

To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory.

即：

节点数
核心数
磁盘space和内存

我知道如何使用 sinfo -n -l 但是我看到太多核心，我无法轻易获得这些信息。有没有更简单的方法来了解我的集群的整体规格？

最终，我试图找到一些合理的参数 --num-executors --executor-cores 和 --executor-memory

Answer 1

节点数：

sinfo -O "nodes" --noheader

核心数：默认情况下，Slurm 的 "cores" 是每个插槽 的核心数 ，而不是节点上可用的核心总数。有点令人困惑的是，在 Slurm 中，cpus = cores * sockets（因此，双处理器、6 核机器将有 2 个插槽、6 个内核和 12 个 cpu）。

核心数量（=Slurm 中的 cpus）、磁盘 space 和 RAM 更难获得，因为它在不同节点上可能不同。以下 returns 一个易于解析的列表：

sinfo -N -O "nodehost,disk,memory,cpus" --noheader

如果所有节点都相同，我们可以从sinfo的第一行获取信息：

每个节点的核心数 (=Slurm cpus)：

sinfo -N -O "cpus" --noheader | head -1

每个节点的 RAM：

sinfo -N -O "memory" --noheader | head -1

磁盘 space 每个节点：

sinfo -N -O "disk" --noheader | head -1

获取节点数、代码数和可用 RAM 以进行调整

obtaining number of nodes, number of codes and available RAM for tuning

slurm

apache-spark

sparklyr