slurmd.service 失败并且没有 PID 文件 /var/run/slurmd.pid
slurmd.service is Failed & there is no PID file /var/run/slurmd.pid
我正在尝试使用以下命令启动 slurmd.service,但无法永久成功。如果您能帮我解决这个问题,我将不胜感激!
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle
这是slurmd.service
的状态
cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
这是节点的状态:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpucompute* up infinite 1 drain fwb-lab-tesla1
$ sinfo -R
REASON USER TIMESTAMP NODELIST
Low RealMemory root 2020-09-28T16:46:28 fwb-lab-tesla1
$ sinfo -Nl
Thu Oct 1 14:00:10 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
fwb-lab-tesla1 1 gpucompute* drained 32 32:1:1 64000 0 1 (null) Low RealMemory
这里有slurm.conf
的内容
$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# Prevent very long time waits for mix serial/parallel in multi node environment
SchedulerParameters=pack_serial_at_end
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
# Need slurmdbd for gres functionality
#AccountingStorageTRES=CPU,Mem,gres/gpu,gres/gpu:Titan
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu
#NodeName=fwb-lab-tesla[1-32] Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
#PartitionName=compute Nodes=fwb-lab-tesla[1-32] Default=YES MaxTime=INFINITE State=UP
#NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 CPUs=32 State=UNKNOWN
PartitionName=gpucompute Nodes=fwb-lab-tesla1 Default=YES MaxTime=INFINITE State=UP
下面的路径中没有任何 slurmd.pid。通过启动系统它只出现一次,但几分钟后又消失了。
$ ls /var/run/
abrt cryptsetup gdm lvm openvpn-server slurmctld.pid tuned
alsactl.pid cups gssproxy.pid lvmetad.pid plymouth sm-notify.pid udev
atd.pid dbus gssproxy.sock mariadb ppp spice-vdagentd user
auditd.pid dhclient-eno2.pid httpd mdadm rpcbind sshd.pid utmp
avahi-daemon dhclient.pid initramfs media rpcbind.sock sudo vpnc
certmonger dmeventd-client ipmievd.pid mount samba svnserve xl2tpd
chrony dmeventd-server lightdm munge screen sysconfig xrdp
console ebtables.lock lock netreport sepermit syslogd.pid xtables.lock
crond.pid faillock log NetworkManager setrans systemd
cron.reboot firewalld lsm openvpn-client setroubleshoot tmpfiles.d
[shirin@FWB-Lab-Tesla Seq2KMR33]$ systemctl status slurmctld
â slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-09-28 15:41:25 BST; 2 days ago
Main PID: 1492 (slurmctld)
CGroup: /system.slice/slurmctld.service
ââ1492 /usr/sbin/slurmctld
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Starting Slurm controller daemon...
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Started Slurm controller daemon.
我尝试启动服务 slurmd.service
,但 returns 几分钟后再次失败
$ systemctl status slurmd
â slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: timeout) since Tue 2020-09-29 18:11:25 BST; 1 day 19h ago
Process: 25650 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
CGroup: /system.slice/slurmd.service
ââ2986 /usr/sbin/slurmd
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Starting Slurm node daemon...
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?) after start: No ...ctory
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service start operation timed out. Terminating.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Failed to start Slurm node daemon.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Unit slurmd.service entered failed state.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
启动 slurmd 的日志输出:
[2020-09-29T18:09:55.074] Message aggregation disabled
[2020-09-29T18:09:55.075] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2020-09-29T18:09:55.075] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2020-09-29T18:09:55.075] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2020-09-29T18:09:55.075] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2020-09-29T18:09:55.095] slurmd version 17.11.7 started
[2020-09-29T18:09:55.096] error: Error binding slurm stream socket: Address already in use
[2020-09-29T18:09:55.096] error: Unable to bind listen port (*:6818): Address already in use```
日志文件指出它无法绑定到标准 slurmd 端口 6818,因为已经有其他东西在使用这个地址。
你在这个节点上有另一个 slurmd 运行 吗?或者其他什么东西在那里听?尝试 netstat -tulpen | grep 6818
看看什么正在使用该地址。
我正在尝试使用以下命令启动 slurmd.service,但无法永久成功。如果您能帮我解决这个问题,我将不胜感激!
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle
这是slurmd.service
的状态 cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
这是节点的状态:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpucompute* up infinite 1 drain fwb-lab-tesla1
$ sinfo -R
REASON USER TIMESTAMP NODELIST
Low RealMemory root 2020-09-28T16:46:28 fwb-lab-tesla1
$ sinfo -Nl
Thu Oct 1 14:00:10 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
fwb-lab-tesla1 1 gpucompute* drained 32 32:1:1 64000 0 1 (null) Low RealMemory
这里有slurm.conf
的内容$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# Prevent very long time waits for mix serial/parallel in multi node environment
SchedulerParameters=pack_serial_at_end
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
# Need slurmdbd for gres functionality
#AccountingStorageTRES=CPU,Mem,gres/gpu,gres/gpu:Titan
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu
#NodeName=fwb-lab-tesla[1-32] Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
#PartitionName=compute Nodes=fwb-lab-tesla[1-32] Default=YES MaxTime=INFINITE State=UP
#NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 CPUs=32 State=UNKNOWN
PartitionName=gpucompute Nodes=fwb-lab-tesla1 Default=YES MaxTime=INFINITE State=UP
下面的路径中没有任何 slurmd.pid。通过启动系统它只出现一次,但几分钟后又消失了。
$ ls /var/run/
abrt cryptsetup gdm lvm openvpn-server slurmctld.pid tuned
alsactl.pid cups gssproxy.pid lvmetad.pid plymouth sm-notify.pid udev
atd.pid dbus gssproxy.sock mariadb ppp spice-vdagentd user
auditd.pid dhclient-eno2.pid httpd mdadm rpcbind sshd.pid utmp
avahi-daemon dhclient.pid initramfs media rpcbind.sock sudo vpnc
certmonger dmeventd-client ipmievd.pid mount samba svnserve xl2tpd
chrony dmeventd-server lightdm munge screen sysconfig xrdp
console ebtables.lock lock netreport sepermit syslogd.pid xtables.lock
crond.pid faillock log NetworkManager setrans systemd
cron.reboot firewalld lsm openvpn-client setroubleshoot tmpfiles.d
[shirin@FWB-Lab-Tesla Seq2KMR33]$ systemctl status slurmctld
â slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-09-28 15:41:25 BST; 2 days ago
Main PID: 1492 (slurmctld)
CGroup: /system.slice/slurmctld.service
ââ1492 /usr/sbin/slurmctld
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Starting Slurm controller daemon...
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Started Slurm controller daemon.
我尝试启动服务 slurmd.service
,但 returns 几分钟后再次失败
$ systemctl status slurmd
â slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: timeout) since Tue 2020-09-29 18:11:25 BST; 1 day 19h ago
Process: 25650 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
CGroup: /system.slice/slurmd.service
ââ2986 /usr/sbin/slurmd
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Starting Slurm node daemon...
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?) after start: No ...ctory
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service start operation timed out. Terminating.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Failed to start Slurm node daemon.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Unit slurmd.service entered failed state.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
启动 slurmd 的日志输出:
[2020-09-29T18:09:55.074] Message aggregation disabled
[2020-09-29T18:09:55.075] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2020-09-29T18:09:55.075] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2020-09-29T18:09:55.075] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2020-09-29T18:09:55.075] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2020-09-29T18:09:55.095] slurmd version 17.11.7 started
[2020-09-29T18:09:55.096] error: Error binding slurm stream socket: Address already in use
[2020-09-29T18:09:55.096] error: Unable to bind listen port (*:6818): Address already in use```
日志文件指出它无法绑定到标准 slurmd 端口 6818,因为已经有其他东西在使用这个地址。
你在这个节点上有另一个 slurmd 运行 吗?或者其他什么东西在那里听?尝试 netstat -tulpen | grep 6818
看看什么正在使用该地址。