slurm 作业因“总线错误”而崩溃是什么意思?

What does it mean for slurm job to crash with `bus error`?

当 运行 通过 slurm Python 脚本时 srun --pty bash 我收到一条神秘的错误消息 Bus error: core dumped

我搜索了 slurm 文档,它没有提到这种错误类型。

这是怎么回事,我该如何解决?

我在 bus error 上找到了这个一般信息,但这并没有解释它在 SLURM 环境中发生的方式和原因以及可以采取哪些措施来避免它:What is a bus error? Is it different from a segmentation fault?

至少在一种情况下,这可能是因为我的工作需要太多内存,因此被 SLURM 杀死。

当我 运行 同一个作业直接在工作节点上因总线错误而崩溃时,它在 claming >30GB 后被杀死。

Ben Evans 在 Yale 集群 Discourse 上的有用回答可能更普遍地适用于其他集群:

On the Yale clusters, a bus error usually means your job ran out of memory (RAM). If you cannot reduce the memory usage of your code, you can request additional memory for your job using the --mem-per-cpu or --mem Slurm flags.

More details: Your program can run into this fault because of the way we manage memory with cgroups 7 so that many jobs can be run on the same physical machine without interfering with one another. If a process inside a job tries to access memory “outside” what was allocated to that job, e.g. more than what you requested, the operating system tells your program that address is invalid with the fault Bus Error, aka SIGBUS, exit(10). A similar fault you might be more familiar with is a Segmentation Fault, aka SIGSEGV, exit(11) which usually results from a program incorrectly trying to access a valid memory address.

https://ask.cyberinfrastructure.org/t/what-does-it-mean-when-i-get-a-bus-error-in-my-job/1101/2