gpucompute* 在 slurm 集群中关闭*
gpucompute* is down* in slurm cluster
我的 gpucompute 节点处于关闭状态,无法在 GPU 节点上发送作业。
在遵循网上的所有解决方案后,我无法 return 我的 'down GPU' 节点。在此问题之前,我在 Nvidia 驱动程序配置中遇到错误,无法通过 'nvidia-smi' 检测到 GPU,在通过 运行 'NVIDIA-Linux-x86_64-410.79.run --no-drm' 解决该错误后,我遇到了这个由于节点处于关闭状态而导致的错误。非常感谢任何帮助!
command: sbatch md1.s
sbatch: error: Batch job submission failed: Requested node configuration is not available
command: sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpucompute* up infinite 1 down* fwb-lab-tesla1
command: sinfo -R
REASON USER TIMESTAMP NODELIST
Not responding slurm 2020-09-25T13:13:19 fwb-lab-tesla1
command: sinfo -Nl
Fri Sep 25 16:35:25 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
fwb-lab-tesla1 1 gpucompute* down* 32 32:1:1 64000 0 1 (null)Not responding
command: vim /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
command: ls /etc/init.d
functions livesys livesys-late netconsole network README
command: nvidia-smi
Fri Sep 25 16:35:01 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:02:00.0 Off | N/A |
| 24% 32C P8 N/A / N/A | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN V Off | 00000000:03:00.0 Off | N/A |
| 23% 35C P8 N/A / N/A | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN V Off | 00000000:83:00.0 Off | N/A |
| 30% 44C P8 N/A / N/A | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN V Off | 00000000:84:00.0 Off | N/A |
| 31% 42C P8 N/A / N/A | 0MiB / 12036MiB | 6% Default |
---------------------------------------------------------------------------+
----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
您提到的问题可能阻止了 gpucompute
上的 slurmd
守护程序启动。您应该能够通过 运行 systemctl status slurmd
或您的 Linux 发行版的等效命令来确认这一点。
slurmd
日志可能包含类似于
的行
slurmd[1234]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
尝试用
重新启动它
systemctl start slurmd
一旦您确定 nvidia-smi
回答正确。
我的问题已按照以下说明解决。请记住,无论何时重新启动系统,您都需要在重新启动后输入命令。感谢 Joan Bryan 解决了这个问题!
slurmd -Dcvvv
reboot
ps -ef | grep slurm
kill xxxx (this is Process id number in the output of previous ps ef command)
nvidia-smi
systemctl start slurmctld
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle
now you can run the jobs on the GPU nodes!
Cheers
我的 gpucompute 节点处于关闭状态,无法在 GPU 节点上发送作业。 在遵循网上的所有解决方案后,我无法 return 我的 'down GPU' 节点。在此问题之前,我在 Nvidia 驱动程序配置中遇到错误,无法通过 'nvidia-smi' 检测到 GPU,在通过 运行 'NVIDIA-Linux-x86_64-410.79.run --no-drm' 解决该错误后,我遇到了这个由于节点处于关闭状态而导致的错误。非常感谢任何帮助!
command: sbatch md1.s
sbatch: error: Batch job submission failed: Requested node configuration is not available
command: sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpucompute* up infinite 1 down* fwb-lab-tesla1
command: sinfo -R
REASON USER TIMESTAMP NODELIST
Not responding slurm 2020-09-25T13:13:19 fwb-lab-tesla1
command: sinfo -Nl
Fri Sep 25 16:35:25 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
fwb-lab-tesla1 1 gpucompute* down* 32 32:1:1 64000 0 1 (null)Not responding
command: vim /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
command: ls /etc/init.d
functions livesys livesys-late netconsole network README
command: nvidia-smi
Fri Sep 25 16:35:01 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:02:00.0 Off | N/A |
| 24% 32C P8 N/A / N/A | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN V Off | 00000000:03:00.0 Off | N/A |
| 23% 35C P8 N/A / N/A | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN V Off | 00000000:83:00.0 Off | N/A |
| 30% 44C P8 N/A / N/A | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN V Off | 00000000:84:00.0 Off | N/A |
| 31% 42C P8 N/A / N/A | 0MiB / 12036MiB | 6% Default |
---------------------------------------------------------------------------+
----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
您提到的问题可能阻止了 gpucompute
上的 slurmd
守护程序启动。您应该能够通过 运行 systemctl status slurmd
或您的 Linux 发行版的等效命令来确认这一点。
slurmd
日志可能包含类似于
slurmd[1234]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
尝试用
重新启动它systemctl start slurmd
一旦您确定 nvidia-smi
回答正确。
我的问题已按照以下说明解决。请记住,无论何时重新启动系统,您都需要在重新启动后输入命令。感谢 Joan Bryan 解决了这个问题!
slurmd -Dcvvv
reboot
ps -ef | grep slurm
kill xxxx (this is Process id number in the output of previous ps ef command)
nvidia-smi
systemctl start slurmctld
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle
now you can run the jobs on the GPU nodes!
Cheers