error: _slurm_rpc_node_registration node=xxxxx: Invalid argument
error: _slurm_rpc_node_registration node=xxxxx: Invalid argument
我正在尝试设置 Slurm - 我只有一个登录节点(名为 ctm-login-01)和一个计算节点(名为 ctm-deep -01)。我的计算节点有几个 CPU 和 3 个 GPU。
我的计算节点一直处于 drain
状态,我一辈子都不知道从哪里开始...
登录节点
sinfo
ctm-login-01:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 drain ctm-deep-01
原因是什么?
sinfo -R
ctm-login-01:~$ sinfo -R
REASON USER TIMESTAMP NODELIST
gres/gpu count repor slurm 2020-12-11T15:56:55 ctm-deep-01
确实,我一直在 /var/log/slurm-llnl/slurmctld.log
中收到这些错误消息:
/var/log/slurm-llnl/slurmctld.log
[2020-12-11T16:17:39.857] gres/gpu: state for ctm-deep-01
[2020-12-11T16:17:39.857] gres_cnt found:0 configured:3 avail:3 alloc:0
[2020-12-11T16:17:39.857] gres_bit_alloc:NULL
[2020-12-11T16:17:39.857] gres_used:(null)
[2020-12-11T16:17:39.857] error: _slurm_rpc_node_registration node=ctm-deep-01: Invalid argument
(请注意,我已将 slurm.conf
调试设置为 verbose
,并且还设置了 DebugFlags=Gres
以获得有关 GPU 的更多详细信息。)
这些是我在所有节点中的配置文件及其部分内容...
/etc/slurm-llnl/* 文件
ctm-login-01:/etc/slurm-llnl$ ls
cgroup.conf cgroup_allowed_devices_file.conf gres.conf plugstack.conf plugstack.conf.d slurm.conf
ctm-login-01:/etc/slurm-llnl$ tail slurm.conf
#SuspendTime=
#
#
# COMPUTE NODES
GresTypes=gpu
NodeName=ctm-deep-01 Gres=gpu:3 CPUs=24 Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=ctm-deep-01 Default=YES MaxTime=INFINITE State=UP
# default
SallocDefaultCommand="srun --gres=gpu:1 $SHELL"
ctm-deep-01:/etc/slurm-llnl$ cat gres.conf
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia0 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia1 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia2 CPUs=0-23
ctm-login-01:/etc/slurm-llnl$ cat cgroup.conf
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
#TaskAffinity=yes
ctm-login-01:/etc/slurm-llnl$ cat cgroup_allowed_devices_file.conf
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/nvidia*
计算节点
我的计算节点中的日志如下。
/var/log/slurm-llnl/slurmd.log
ctm-deep-01:~$ sudo tail /var/log/slurm-llnl/slurmd.log
[2020-12-11T15:54:35.787] Munge credential signature plugin unloaded
[2020-12-11T15:54:35.788] Slurmd shutdown completing
[2020-12-11T15:55:53.433] Message aggregation disabled
[2020-12-11T15:55:53.436] topology NONE plugin loaded
[2020-12-11T15:55:53.436] route default plugin loaded
[2020-12-11T15:55:53.440] task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffff
[2020-12-11T15:55:53.440] Munge credential signature plugin loaded
[2020-12-11T15:55:53.441] slurmd version 19.05.5 started
[2020-12-11T15:55:53.442] slurmd started on Fri, 11 Dec 2020 15:55:53 +0000
[2020-12-11T15:55:53.443] CPUs=24 Boards=1 Sockets=1 Cores=12 Threads=2 Memory=128754 TmpDisk=936355 Uptime=26 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
那个CPU面具亲和力看起来很奇怪...
请注意,我已经打电话给 sudo nvidia-smi --persistence-mode=1
。另请注意,上述 gres.conf
文件似乎是正确的:
nvidia-smi topo -m
ctm-deep-01:/etc/slurm-llnl$ sudo nvidia-smi topo -m
GPU0 GPU1 GPU2 CPU Affinity NUMA Affinity
GPU0 X SYS SYS 0-23 N/A
GPU1 SYS X PHB 0-23 N/A
GPU2 SYS PHB X 0-23 N/A
我应该从中获取线索的任何其他日志或配置?谢谢!
全是错字!
ctm-deep-01:/etc/slurm-llnl$ cat gres.conf
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia0 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia1 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia2 CPUs=0-23
很明显,那应该是NodeName=ctm-deep-01
,这是我的计算节点!哎呀...
我正在尝试设置 Slurm - 我只有一个登录节点(名为 ctm-login-01)和一个计算节点(名为 ctm-deep -01)。我的计算节点有几个 CPU 和 3 个 GPU。
我的计算节点一直处于 drain
状态,我一辈子都不知道从哪里开始...
登录节点
sinfo
ctm-login-01:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 drain ctm-deep-01
原因是什么?
sinfo -R
ctm-login-01:~$ sinfo -R
REASON USER TIMESTAMP NODELIST
gres/gpu count repor slurm 2020-12-11T15:56:55 ctm-deep-01
确实,我一直在 /var/log/slurm-llnl/slurmctld.log
中收到这些错误消息:
/var/log/slurm-llnl/slurmctld.log
[2020-12-11T16:17:39.857] gres/gpu: state for ctm-deep-01
[2020-12-11T16:17:39.857] gres_cnt found:0 configured:3 avail:3 alloc:0
[2020-12-11T16:17:39.857] gres_bit_alloc:NULL
[2020-12-11T16:17:39.857] gres_used:(null)
[2020-12-11T16:17:39.857] error: _slurm_rpc_node_registration node=ctm-deep-01: Invalid argument
(请注意,我已将 slurm.conf
调试设置为 verbose
,并且还设置了 DebugFlags=Gres
以获得有关 GPU 的更多详细信息。)
这些是我在所有节点中的配置文件及其部分内容...
/etc/slurm-llnl/* 文件
ctm-login-01:/etc/slurm-llnl$ ls
cgroup.conf cgroup_allowed_devices_file.conf gres.conf plugstack.conf plugstack.conf.d slurm.conf
ctm-login-01:/etc/slurm-llnl$ tail slurm.conf
#SuspendTime=
#
#
# COMPUTE NODES
GresTypes=gpu
NodeName=ctm-deep-01 Gres=gpu:3 CPUs=24 Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=ctm-deep-01 Default=YES MaxTime=INFINITE State=UP
# default
SallocDefaultCommand="srun --gres=gpu:1 $SHELL"
ctm-deep-01:/etc/slurm-llnl$ cat gres.conf
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia0 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia1 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia2 CPUs=0-23
ctm-login-01:/etc/slurm-llnl$ cat cgroup.conf
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
#TaskAffinity=yes
ctm-login-01:/etc/slurm-llnl$ cat cgroup_allowed_devices_file.conf
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/nvidia*
计算节点
我的计算节点中的日志如下。
/var/log/slurm-llnl/slurmd.log
ctm-deep-01:~$ sudo tail /var/log/slurm-llnl/slurmd.log
[2020-12-11T15:54:35.787] Munge credential signature plugin unloaded
[2020-12-11T15:54:35.788] Slurmd shutdown completing
[2020-12-11T15:55:53.433] Message aggregation disabled
[2020-12-11T15:55:53.436] topology NONE plugin loaded
[2020-12-11T15:55:53.436] route default plugin loaded
[2020-12-11T15:55:53.440] task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffff
[2020-12-11T15:55:53.440] Munge credential signature plugin loaded
[2020-12-11T15:55:53.441] slurmd version 19.05.5 started
[2020-12-11T15:55:53.442] slurmd started on Fri, 11 Dec 2020 15:55:53 +0000
[2020-12-11T15:55:53.443] CPUs=24 Boards=1 Sockets=1 Cores=12 Threads=2 Memory=128754 TmpDisk=936355 Uptime=26 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
那个CPU面具亲和力看起来很奇怪...
请注意,我已经打电话给 sudo nvidia-smi --persistence-mode=1
。另请注意,上述 gres.conf
文件似乎是正确的:
nvidia-smi topo -m
ctm-deep-01:/etc/slurm-llnl$ sudo nvidia-smi topo -m
GPU0 GPU1 GPU2 CPU Affinity NUMA Affinity
GPU0 X SYS SYS 0-23 N/A
GPU1 SYS X PHB 0-23 N/A
GPU2 SYS PHB X 0-23 N/A
我应该从中获取线索的任何其他日志或配置?谢谢!
全是错字!
ctm-deep-01:/etc/slurm-llnl$ cat gres.conf
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia0 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia1 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia2 CPUs=0-23
很明显,那应该是NodeName=ctm-deep-01
,这是我的计算节点!哎呀...