如何找到与 sbatch 作业相关的进程?

How do I find the processes that are related to a sbatch job?

当我在多节点系统上使用 sbatch 开始作业时,一些进程正在相关节点上启动。

我如何才能找出由于 sbatch 运行 而在这些节点上 运行ning 的进程(进程 ID)?

我检查了 slurm 文档,但没有找到任何显示相关进程的命令(例如 scontrolsstat)。

想法是找到进程 ID,然后使用 Linux 工具调试正在 'stuck' 的进程(即没有输出等),也许找出这个特定进程是什么做.

您要找的是scontrol listpids。来自 scontrol manpage:

listpids [job_id[.step_id]] [NodeName]

Print a listing of the process IDs in a job step (if JOBID.STEPID is provided), or all of the job steps in a job (if job_id is provided), or all of the job steps in all of the jobs on the local node (if job_id is not provided or job_id is "*"). This will work only with processes on the node on which scontrol is run, and only for those processes spawned by Slurm and their descendants. Note that some Slurm configurations (ProctrackType value of pgid) are unable to identify all processes associated with a job or job step. Note that the NodeName option is only really useful when you have multiple slurmd daemons running on the same host machine. Multiple slurmd daemons on one host are, in general, only used by Slurm developers.

只需通过 SSH 连接到计算节点和 运行 scontrol listpids。它将输出一个 table 与 PID / JOBID 对应关系。

[root@node003 ~]# scontrol listpids | column -t
PID     JOBID     STEPID      LOCALID  GLOBALID
269852  68706234  batch       0        0
269998  68706234  batch       -        -
[etc.]

我在这里使用 column 命令来更好地对齐列并便于阅读。