提交后在 Slurm 中添加作业数组元素

Question

我正在尝试使用 Slurm 操作的集群运行 LS-Dyna（有限元模拟程序 我的集群上可用的许可证数量有限 ).我正在尝试编写我的批处理脚本，这样我就不会因为这个许可限制而浪费处理时间（以及在运行ning 'squeue' 命令时提高易读性）通过使用作业数组 - 但我'我无法完成这项工作。

我想在各种 FEM 网格中运行相同的 Bash 脚本，我已将每个网格组织到不同的子文件夹中。

考虑到我集群上的这个文件夹结构...

cluster root
|
...
|
|-+ my scratch space's root
  |
  |-+ this project
    |
    |--+ lat_-5mm
    |  |- runCurrentLine.bash
    |  |- other files
    |
    |--+ lat_-4.75mm
    |  |- runCurrentLine.bash
    |  |- other files
    |
    |--+ lat_-4.5mm
    |  |- runCurrentLine.bash
    |  |- other files
    |
   ...
    |
    |--+ lat_5mm
    |  |- runCurrentLine.bash
    |  |- other files
    |
    |
    |-sendDynaRuns.bash
    |-other dependencies

...我正在尝试通过运行在我的登录节点中使用以下脚本在每个文件夹中提交 "runCurrentLine.bash"。

#!/bin/bash
iter=0
for foldernow in */; do

# change to subdirectory for current line iteration
    cd "./${foldernow}";

# make Slurm and user happy
    echo "sending LS Dyna simulation for ${pos}mm line..."
    sleep 1

# first line only: send batch, and get job ID
    if [ "${iter}" == 0 ];then

# send the batch...
        jobID=$(sbatch -J "Dyna" --array="${iter}"%15 runCurrentLine.bash)

# ...ensure that Slurm's output shows on console (which includes the job ID)...
        echo "${jobID}"

# ...and extract the job ID and save as a variable
        jobID=$(echo "${jobID}" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?')

# subsequent lines: add current line to job array
    else
        scontrol update --jobid="${jobID}" --array="${iter}"%15 runCurrentLine.bash
    fi

# prepare to move onto next position
    iter=$((iter+1))
    cd ../
done

此设置在 -0.25mm* 处正确发送第一行的批处理作业。然而，从第二行开始，它似乎并没有做同样的事情......这就是我最终在我的控制台上得到的：

*：我打算 "lat_xmm" 文件夹按数字顺序排列，但 Unix 似乎无法识别

$ ./sendDynaRuns.bash
sending LS Dyna simulation for -0.25mm line...
Submitted batch job 1081040
sending LS Dyna simulation for 0.25mm line...
sbatch: error: Batch job submission failed: Invalid job id specified
sending LS Dyna simulation for -0.5mm line...
sbatch: error: Batch job submission failed: Invalid job id specified

我知道 runCurrentLine.bash 运行如果我手动批量发送它就没问题（而且它运行在我在文件中指定的时限内完成，主要是因为它不必与其他线路竞争开放许可证）。我应该怎么做才能让我的代码正常工作？

提前致谢！

Answer 1

如@Poshi 所述，您不能将作业添加到现有数组。

我会创建一个这样的提交脚本：

#!/bin/bash
#SBATCH --array=1-<nb of folders>%15
# ALL OTHER SLURM SBATCH DIRECTIVES HERE

folders=(lat_*)
foldernow=${folders[$SLURM_TASK_ARRAY_ID]}

cd $foldernow && ./runCurrentLine.bash

唯一的缺点是您需要根据文件夹数量明确设置数组中的作业数量。

提交后在 Slurm 中添加作业数组元素

Adding Job Array elements in Slurm after submission

bash

slurm