SLURM 下循环中缺少迭代

Question

我有一个简单的代码，它遍历一个文件并做一些简单的微积分。下面的代码是一个最大的代码的摘录：不要在此代码中要求任何实用程序，它只是问题的最小示例。

#!/bin/bash

#SBATCH --job-name=test
#SBATCH --output=test_%j.out
#SBATCH --error=test_%j.err
#SBATCH --workdir=.
#SBATCH --time=0:5:0
#SBATCH --partition=main
#SBATCH --qos=lowprio
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --requeue

for SM in MB BL
do
    while read -r id
    do
        srun --job-name "Test-${id}" --nodes 1 --ntasks 1 --cpus-per-task 1 ls "$id" 1>&2
        echo "${id}"
    done < <(grep "$SM" internal.txt | awk '{print  "_"  "_"  ".txt"}') > "test_${SM}.dat"
done

这段代码的原理是：在一个名为 internal.txt 的文件中，我有一个数据列表，我需要将其分为两组，分别命名为 MB 和 BL。我使用 grep 搜索每个组，我使用 awk 组成文件的基本名称，并将其作为 id 馈送到 while 循环。在该循环中，我使用 srun 启动命令（本例中为 ls），结果，我只输出 $id.

internal.txt 文件包含：

file 1 BL
file 1 MB
file 2 BL
file 2 MB
file 3 MB

所以预期的输出是两个文件，test_BL.dat:

file_1_BL.txt
file_2_BL.txt

和test_MB.dat：

file_1_MB.txt
file_2_MB.txt
file_3_MB.txt

但实际情况是我得到了这两个文件...只有第一行写了 test_BL.dat:

file_1_BL.txt

和test_MB.dat：

file_1_MB.txt

我已经知道 srun 与问题有关，因为如果我去掉 srun 并只保留 ls，它会按预期工作：

#!/bin/bash

#SBATCH --job-name=test
#SBATCH --output=test_%j.out
#SBATCH --error=test_%j.err
#SBATCH --workdir=.
#SBATCH --time=0:5:0
#SBATCH --partition=main
#SBATCH --qos=lowprio
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --requeue

for SM in MB BL
do
    while read -r id
    do
        ls "$id" 1>&2
        echo "${id}"
    done < <(grep "$SM" internal.txt | awk '{print  "_"  "_"  ".txt"}') > "test_${SM}.dat"
done

最后一个代码运行良好，但现在我缺少 srun。对这里发生的事情有什么想法吗？

注意：列出的文件存在。

Answer 1

感谢@Inian，问题已解决！

诀窍是 srun，默认情况下，读取其标准输入以将其广播到它正在启动的不同子进程。它不等待子进程开始读取输入，它只是读取其输入并将其保存在缓冲区中，直到有人读取或进程完成（然后数据被丢弃）。

要解决眼前的问题，我们只需要关闭srun的标准输入即可。最简单的方法是使用 --input 参数，将其设置为 none:

srun --input none --job-name "Test-${id}" --nodes 1 --ntasks 1 --cpus-per-task 1 ls "$id" 1>&2

使用 bash 工具关闭标准输入（即添加 <&-）或将 /dev/null 重定向到标准输入（< /dev/null）也有效（已测试） .

SLURM 下循环中缺少迭代

Missing iterations in loop under SLURM

bash

slurm

sbatch