slurmstepd: error: Exceeded step memory limit at some point

Question

我是运行 Bluehive 的代码。代码有一些参数 N。如果 N 很小，那么代码运行非常好。但是对于稍大的 N（例如 N=10），代码是运行几个小时，最后我收到以下错误消息：

slurmstepd: error: Exceeded step memory limit at some point.

我提交的批处理文件包含以下代码：

#!/bin/bash
#SBATCH -o log.%a.txt -t 3-01:01:00
#SBATCH --mem-per-cpu=1gb
#SBATCH -c 4
#SBATCH --gres=gpu:1 
#SBATCH -J Ankani
#SBATCH -a 1-2

python run.py $SLURM_ARRAY_TASK_ID

我正在为代码分配足够的内存。但仍然出现错误

"slurmstepd: error: Exceeded step memory limit at some point."

有人可以帮忙吗？

Answer 1

不过，我会注意到，此错误消息中"step memory limit"描述的内存限制不一定与您进程的RSS有关。此限制由 cgroup 插件和内存 cgroups

提供并强制执行

track not only RSS of tasks in your job but file cache, mmap pages, etc. If I had to guess you are hitting memory limit due to page cache. In that case, you might be able to just ignore this error since hitting the limit here probably just triggered memory reclaim which freed cached pages (this shouldn't be a fatal error).

If you'd like to avoid the error, and you're only writing out data and don't want it cached, then you could try playing with posix_fadvise(2) using the POSIX_FADV_DONTNEED which hints to the VM that you aren't going to read the pages you're writing out again.

这里是the source of this text

slurmstepd: error: Exceeded step memory limit at some point

slurmstepd: error: Exceeded step memory limit at some point

cluster-computing

slurm