MPI 结果在 Slurm 下和使用命令时不同

MPI result is different under Slurm and by using command

我在 运行 Slurm 的 MPI 项目时遇到了问题。

a1 是我的可执行文件。 当我 运行 mpiexec -np 4 ./a1

时效果很好

但是当我在 Slurm 下 运行 它不能正常工作,看起来它停在中间:

这是使用mpiexec -np 4 ./a1的输出,这是正确的。

Processor1 will send and receive with processor0
Processor3 will send and receive with processor0
Processor0 will send and receive with processor1
Processor0 finished send and receive with processor1
Processor1 finished send and receive with processor0
Processor2 will send and receive with processor0
Processor1 will send and receive with processor2
Processor2 finished send and receive with processor0
Processor0 will send and receive with processor2
Processor0 finished send and receive with processor2
Processor0 will send and receive with processor3
Processor0 finished send and receive with processor3
Processor3 finished send and receive with processor0
Processor1 finished send and receive with processor2
Processor2 will send and receive with processor1
Processor2 finished send and receive with processor1
Processor0: I am very good, I save the hash in range 0 to 65
p: 4
Tp: 8.61754
Processor1 will send and receive with processor3
Processor3 will send and receive with processor1
Processor3 finished send and receive with processor1
Processor1 finished send and receive with processor3
Processor2 will send and receive with processor3
Processor1: I am very good, I save the hash in range 65 to 130
Processor2 finished send and receive with processor3
Processor3 will send and receive with processor2
Processor3 finished send and receive with processor2
Processor3: I am very good, I save the hash in range 195 to 260
Processor2: I am very good, I save the hash in range 130 to 195

这是 Slurm 下的输出,它不像使用命令那样 return 整个结果。

Processor0 will send and receive with processor1
Processor2 will send and receive with processor0
Processor3 will send and receive with processor0
Processor1 will send and receive with processor0
Processor0 finished send and receive with processor1
Processor1 finished send and receive with processor0
Processor0 will send and receive with processor2
Processor0 finished send and receive with processor2
Processor2 finished send and receive with processor0
Processor1 will send and receive with processor2
Processor0 will send and receive with processor3
Processor2 will send and receive with processor1
Processor2 finished send and receive with processor1
Processor2 will send and receive with processor3
Processor1 finished send and receive with processor2

这是我的 Slurm.sh 文件:我想我在其中犯了一些错误,结果与命令一不同,但我不确定这一点...

#!/bin/bash

####### select partition (check CCR documentation)
#SBATCH --partition=general-compute --qos=general-compute

####### set memory that nodes provide (check CCR documentation, e.g., 32GB)
#SBATCH --mem=64000

####### make sure no other jobs are assigned to your nodes
#SBATCH --exclusive

####### further customizations
#SBATCH --job-name="a1"
#SBATCH --output=%j.stdout
#SBATCH --error=%j.stderr
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --time=12:00:00

mpiexec -np 4 ./a1

再次,回来解决我的问题。 我犯了一个愚蠢的错误,我为我的 mpi 代码使用了错误的 slurm.sh。 正确的 slurm.sh 是:

#!/bin/bash

####### select partition (check CCR documentation)
#SBATCH --partition=general-compute --qos=general-compute

####### set memory that nodes provide (check CCR documentation, e.g., 32GB)
#SBATCH --mem=32000

####### make sure no other jobs are assigned to your nodes
#SBATCH --exclusive

####### further customizations
#SBATCH --job-name="a1"
#SBATCH --output=%j.stdout
#SBATCH --error=%j.stderr
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=12
#SBATCH --time=01:00:00

####### check modules to see which version of MPI is available
####### and use appropriate module if needed
module load intel-mpi/2018.3
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so

srun /.a1

我很傻所以才取小南这个昵称...希望我能变聪明