slurmd 在启动时忽略 slurm 配置

slurmd ignores slurm config on startup

我不明白为什么我的配置会被忽略,即使直接指定 -f 也是如此。 Google 没有结果,是否有任何相关文档可以查看?

希望我只是完全错过了一些关键信息

在一台机器上启动 slurmctl 守护程序后,尝试 运行 sudo slurmd -f /usr/local/etc/slurm.conf -D -vvvvvvv(用于测试)给出输出(相关摘录)(注意 RealMemory = 3907):

slurmd: debug3: Confile     = `/usr/local/etc/slurm.conf'
slurmd: debug3: Debug       = 3
slurmd: debug3: CPUs        = 2  (CF:  2, HW:  2)
slurmd: debug3: Boards      = 1  (CF:  1, HW:  1)
slurmd: debug3: Sockets     = 2  (CF:  1, HW:  2)
slurmd: debug3: Cores       = 1  (CF:  2, HW:  1)
slurmd: debug3: Threads     = 1  (CF:  1, HW:  1)
slurmd: debug3: UpTime      = 8838 = 02:27:18
slurmd: debug3: Block Map   = 0,1
slurmd: debug3: Inverse Map = 0,1
slurmd: debug3: RealMemory  = 3907
slurmd: debug3: TmpDisk     = 19018
slurmd: debug3: Epilog      = `(null)'
slurmd: debug3: Logfile     = `/var/log/slurmd.log'
slurmd: debug3: HealthCheck = `(null)'
slurmd: debug3: NodeName    = node1
slurmd: debug3: Port        = 6818
slurmd: debug3: Prolog      = `(null)'
slurmd: debug3: TmpFS       = `/tmp'
slurmd: debug3: Public Cert = `(null)'
slurmd: debug3: Slurmstepd  = `/usr/local/sbin/slurmstepd'
slurmd: debug3: Spool Dir   = `/var/spool/slurmd'
slurmd: debug3: Syslog Debug  = 10
slurmd: debug3: Pid File    = `/var/run/slurm/slurmd.pid'
slurmd: debug3: Slurm UID   = 64030
slurmd: debug3: TaskProlog  = `(null)'
slurmd: debug3: TaskEpilog  = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: UsePAM      = 0

ctld 垃圾邮件

slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug:  Node node1 has low real_memory size (3907 < 2000000)

slurm.conf 来自 cat /usr/local/etc/slurm.conf | grep -v "#" 的输出(注意 RealMemory=2000000,以及其他忽略的配置细节):

ClusterName=scluster_0
SlurmctldHost=controller
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=0
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=node[1-2] CPUs=2 RealMemory=2000000 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
PartitionName=pdefault Nodes=ALL Default=YES MaxTime=INFINITE State=UP

两个系统(slurmctl 守护进程和 slurm 守护进程)的配置相同

我也有 cgroup_allowed_devices.confcgroup.conf 如果这些是相关的

我的猜测是: slurmd ist 正确读取配置文件。发生的事情是 Slurm 与实际检测到的硬件交叉检查配置。根据配置,它注意到它应该有 2000000 RealMemory,但在查看硬件时只找到 3907。这种不匹配被报告并且节点被耗尽。

此行为可确保您的服务器中的 DIMM 不会在您不注意的情况下出现故障。

@Marcus Boden 是正确的。

slurmd 输出中的 RealMemory = 3907 是 Slurm 在服务器上发现的内容,而不是它从文档中读取的内容。

它发现有 3907MB 的 RAM 并将其与它在配置文件中找到的 2000000 进行比较并抱怨

slurmctld: debug:  Node node1 has low real_memory size (3907 < 2000000)

所以,基本上,它找到了 4GB 的 RAM,而它根据配置预计会找到 2TB。

您应该在服务器上检查 Linux 使用 free 命令找到的确切内存量,并确保它符合您认为的规格。

例如查看更多信息here