slurmd.service 失败并且没有 PID 文件 /var/run/slurmd.pid

slurmd.service is Failed & there is no PID file /var/run/slurmd.pid

我正在尝试使用以下命令启动 slurmd.service,但无法永久成功。如果您能帮我解决这个问题,我将不胜感激!

systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle

这是slurmd.service

的状态
 cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity


[Install]
WantedBy=multi-user.target

这是节点的状态:

$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpucompute*    up   infinite      1  drain fwb-lab-tesla1

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       root      2020-09-28T16:46:28 fwb-lab-tesla1

$ sinfo -Nl
Thu Oct  1 14:00:10 2020
NODELIST        NODES   PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
fwb-lab-tesla1      1 gpucompute*     drained   32   32:1:1  64000        0      1   (null) Low RealMemory  

这里有slurm.conf

的内容
$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# Prevent very long time waits for mix serial/parallel in multi node environment 
SchedulerParameters=pack_serial_at_end
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
# Need slurmdbd for gres functionality
#AccountingStorageTRES=CPU,Mem,gres/gpu,gres/gpu:Titan
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu
#NodeName=fwb-lab-tesla[1-32] Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
#PartitionName=compute Nodes=fwb-lab-tesla[1-32] Default=YES MaxTime=INFINITE State=UP
#NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 CPUs=32 State=UNKNOWN
PartitionName=gpucompute Nodes=fwb-lab-tesla1 Default=YES MaxTime=INFINITE State=UP

下面的路径中没有任何 slurmd.pid。通过启动系统它只出现一次,但几分钟后又消失了。

$ ls /var/run/
abrt          cryptsetup         gdm            lvm             openvpn-server  slurmctld.pid   tuned
alsactl.pid   cups               gssproxy.pid   lvmetad.pid     plymouth        sm-notify.pid   udev
atd.pid       dbus               gssproxy.sock  mariadb         ppp             spice-vdagentd  user
auditd.pid    dhclient-eno2.pid  httpd          mdadm           rpcbind         sshd.pid        utmp
avahi-daemon  dhclient.pid       initramfs      media           rpcbind.sock    sudo            vpnc
certmonger    dmeventd-client    ipmievd.pid    mount           samba           svnserve        xl2tpd
chrony        dmeventd-server    lightdm        munge           screen          sysconfig       xrdp
console       ebtables.lock      lock           netreport       sepermit        syslogd.pid     xtables.lock
crond.pid     faillock           log            NetworkManager  setrans         systemd
cron.reboot   firewalld          lsm            openvpn-client  setroubleshoot  tmpfiles.d
[shirin@FWB-Lab-Tesla Seq2KMR33]$ systemctl status slurmctld
â slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-09-28 15:41:25 BST; 2 days ago
 Main PID: 1492 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           ââ1492 /usr/sbin/slurmctld

Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Starting Slurm controller daemon...
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Started Slurm controller daemon.

我尝试启动服务 slurmd.service,但 returns 几分钟后再次失败

$ systemctl status slurmd
â slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Tue 2020-09-29 18:11:25 BST; 1 day 19h ago
  Process: 25650 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/slurmd.service
           ââ2986 /usr/sbin/slurmd

Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Starting Slurm node daemon...
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?) after start: No ...ctory
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service start operation timed out. Terminating.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Failed to start Slurm node daemon.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Unit slurmd.service entered failed state.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

启动 slurmd 的日志输出:

[2020-09-29T18:09:55.074] Message aggregation disabled
[2020-09-29T18:09:55.075] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2020-09-29T18:09:55.075] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2020-09-29T18:09:55.075] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2020-09-29T18:09:55.075] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2020-09-29T18:09:55.095] slurmd version 17.11.7 started
[2020-09-29T18:09:55.096] error: Error binding slurm stream socket: Address already in use
[2020-09-29T18:09:55.096] error: Unable to bind listen port (*:6818): Address already in use```

日志文件指出它无法绑定到标准 slurmd 端口 6818,因为已经有其他东西在使用这个地址。

你在这个节点上有另一个 slurmd 运行 吗?或者其他什么东西在那里听?尝试 netstat -tulpen | grep 6818 看看什么正在使用该地址。