slurmd 在启动时忽略 slurm 配置
slurmd ignores slurm config on startup
我不明白为什么我的配置会被忽略,即使直接指定 -f
也是如此。 Google 没有结果,是否有任何相关文档可以查看?
希望我只是完全错过了一些关键信息
在一台机器上启动 slurmctl 守护程序后,尝试 运行 sudo slurmd -f /usr/local/etc/slurm.conf -D -vvvvvvv
(用于测试)给出输出(相关摘录)(注意 RealMemory = 3907
):
slurmd: debug3: Confile = `/usr/local/etc/slurm.conf'
slurmd: debug3: Debug = 3
slurmd: debug3: CPUs = 2 (CF: 2, HW: 2)
slurmd: debug3: Boards = 1 (CF: 1, HW: 1)
slurmd: debug3: Sockets = 2 (CF: 1, HW: 2)
slurmd: debug3: Cores = 1 (CF: 2, HW: 1)
slurmd: debug3: Threads = 1 (CF: 1, HW: 1)
slurmd: debug3: UpTime = 8838 = 02:27:18
slurmd: debug3: Block Map = 0,1
slurmd: debug3: Inverse Map = 0,1
slurmd: debug3: RealMemory = 3907
slurmd: debug3: TmpDisk = 19018
slurmd: debug3: Epilog = `(null)'
slurmd: debug3: Logfile = `/var/log/slurmd.log'
slurmd: debug3: HealthCheck = `(null)'
slurmd: debug3: NodeName = node1
slurmd: debug3: Port = 6818
slurmd: debug3: Prolog = `(null)'
slurmd: debug3: TmpFS = `/tmp'
slurmd: debug3: Public Cert = `(null)'
slurmd: debug3: Slurmstepd = `/usr/local/sbin/slurmstepd'
slurmd: debug3: Spool Dir = `/var/spool/slurmd'
slurmd: debug3: Syslog Debug = 10
slurmd: debug3: Pid File = `/var/run/slurm/slurmd.pid'
slurmd: debug3: Slurm UID = 64030
slurmd: debug3: TaskProlog = `(null)'
slurmd: debug3: TaskEpilog = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: UsePAM = 0
ctld 垃圾邮件
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug: Node node1 has low real_memory size (3907 < 2000000)
slurm.conf
来自 cat /usr/local/etc/slurm.conf | grep -v "#"
的输出(注意 RealMemory=2000000
,以及其他忽略的配置细节):
ClusterName=scluster_0
SlurmctldHost=controller
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=0
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=node[1-2] CPUs=2 RealMemory=2000000 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
PartitionName=pdefault Nodes=ALL Default=YES MaxTime=INFINITE State=UP
两个系统(slurmctl 守护进程和 slurm 守护进程)的配置相同
我也有 cgroup_allowed_devices.conf
和 cgroup.conf
如果这些是相关的
我的猜测是:
slurmd ist 正确读取配置文件。发生的事情是 Slurm 与实际检测到的硬件交叉检查配置。根据配置,它注意到它应该有 2000000 RealMemory,但在查看硬件时只找到 3907。这种不匹配被报告并且节点被耗尽。
此行为可确保您的服务器中的 DIMM 不会在您不注意的情况下出现故障。
@Marcus Boden 是正确的。
slurmd
输出中的 RealMemory = 3907
是 Slurm 在服务器上发现的内容,而不是它从文档中读取的内容。
它发现有 3907MB 的 RAM 并将其与它在配置文件中找到的 2000000 进行比较并抱怨
slurmctld: debug: Node node1 has low real_memory size (3907 < 2000000)
所以,基本上,它找到了 4GB 的 RAM,而它根据配置预计会找到 2TB。
您应该在服务器上检查 Linux 使用 free
命令找到的确切内存量,并确保它符合您认为的规格。
例如查看更多信息here。
我不明白为什么我的配置会被忽略,即使直接指定 -f
也是如此。 Google 没有结果,是否有任何相关文档可以查看?
希望我只是完全错过了一些关键信息
在一台机器上启动 slurmctl 守护程序后,尝试 运行 sudo slurmd -f /usr/local/etc/slurm.conf -D -vvvvvvv
(用于测试)给出输出(相关摘录)(注意 RealMemory = 3907
):
slurmd: debug3: Confile = `/usr/local/etc/slurm.conf'
slurmd: debug3: Debug = 3
slurmd: debug3: CPUs = 2 (CF: 2, HW: 2)
slurmd: debug3: Boards = 1 (CF: 1, HW: 1)
slurmd: debug3: Sockets = 2 (CF: 1, HW: 2)
slurmd: debug3: Cores = 1 (CF: 2, HW: 1)
slurmd: debug3: Threads = 1 (CF: 1, HW: 1)
slurmd: debug3: UpTime = 8838 = 02:27:18
slurmd: debug3: Block Map = 0,1
slurmd: debug3: Inverse Map = 0,1
slurmd: debug3: RealMemory = 3907
slurmd: debug3: TmpDisk = 19018
slurmd: debug3: Epilog = `(null)'
slurmd: debug3: Logfile = `/var/log/slurmd.log'
slurmd: debug3: HealthCheck = `(null)'
slurmd: debug3: NodeName = node1
slurmd: debug3: Port = 6818
slurmd: debug3: Prolog = `(null)'
slurmd: debug3: TmpFS = `/tmp'
slurmd: debug3: Public Cert = `(null)'
slurmd: debug3: Slurmstepd = `/usr/local/sbin/slurmstepd'
slurmd: debug3: Spool Dir = `/var/spool/slurmd'
slurmd: debug3: Syslog Debug = 10
slurmd: debug3: Pid File = `/var/run/slurm/slurmd.pid'
slurmd: debug3: Slurm UID = 64030
slurmd: debug3: TaskProlog = `(null)'
slurmd: debug3: TaskEpilog = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: UsePAM = 0
ctld 垃圾邮件
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug: Node node1 has low real_memory size (3907 < 2000000)
slurm.conf
来自 cat /usr/local/etc/slurm.conf | grep -v "#"
的输出(注意 RealMemory=2000000
,以及其他忽略的配置细节):
ClusterName=scluster_0
SlurmctldHost=controller
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=0
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=node[1-2] CPUs=2 RealMemory=2000000 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
PartitionName=pdefault Nodes=ALL Default=YES MaxTime=INFINITE State=UP
两个系统(slurmctl 守护进程和 slurm 守护进程)的配置相同
我也有 cgroup_allowed_devices.conf
和 cgroup.conf
如果这些是相关的
我的猜测是: slurmd ist 正确读取配置文件。发生的事情是 Slurm 与实际检测到的硬件交叉检查配置。根据配置,它注意到它应该有 2000000 RealMemory,但在查看硬件时只找到 3907。这种不匹配被报告并且节点被耗尽。
此行为可确保您的服务器中的 DIMM 不会在您不注意的情况下出现故障。
@Marcus Boden 是正确的。
slurmd
输出中的 RealMemory = 3907
是 Slurm 在服务器上发现的内容,而不是它从文档中读取的内容。
它发现有 3907MB 的 RAM 并将其与它在配置文件中找到的 2000000 进行比较并抱怨
slurmctld: debug: Node node1 has low real_memory size (3907 < 2000000)
所以,基本上,它找到了 4GB 的 RAM,而它根据配置预计会找到 2TB。
您应该在服务器上检查 Linux 使用 free
命令找到的确切内存量,并确保它符合您认为的规格。
例如查看更多信息here。