如何在 Cray XE6 计算节点（Unix like env）上使用带有 aprun 命令的 GNU 并行（bash 脚本）？

Question

我正在尝试运行 mpi4py python 脚本上的 16 个实例：hello.py。我在 s.txt 中存储了 16 个这样的命令：

python /lustre/4_mpi4py/hello.py > 01.out

我通过 ap运行命令在 Cray 集群中提交这个：

aprun -n 32 sh -c 'parallel -j 8 :::: s.txt'

我的意图是运行 time.The 脚本中每个节点的 8 个 python 作业运行超过 3 小时 none *.out 文件已创建。从 PBS 调度程序输出文件我得到这个：

Python version 2.7.3 loaded
aprun: Apid 11432669: Caught signal Terminated, sending to application
aprun: Apid 11432669: Caught signal Terminated, sending to application
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 02.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 06.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 10.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 08.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out

我运行在一个节点上安装它，它有 32 个核心。我想我对 GNU 并行命令的使用是错误的。有人可以帮忙吗。

Answer 1

如https://portal.tacc.utexas.edu/documents/13601/1102030/4_mpi4py.pdf#page=8

中所列

from mpi4py import MPI

comm = MPI . COMM_WORLD

print " Hello ! I’m rank %02d from %02 d" % ( comm .rank , comm . size )

print " Hello ! I’m rank %02d from %02 d" % ( comm . Get_rank () ,
comm . Get_size () )

print " Hello ! I’m rank %02d from %02 d" %
( MPI . COMM_WORLD . Get_rank () , MPI . COMM_WORLD . Get_size () )

您的 4_mpi4py/hello.py 程序不是典型的单进程（或单个 python 脚本），而是 多进程 MPI 应用程序.

GNU parallel 期望更简单的程序并且不支持与 MPI 进程的交互。

在你的集群中有很多节点，每个节点可能启动不同数量的 MPI 进程（每个节点有 2 个 8 核 CPU 考虑变体：2 个 MPI 进程，每个 8 个 OpenMP 线程；1 16 个线程的 MPI 进程；16 个 MPI 进程没有线程）。为了向您的任务描述集群切片，集群管理软件和脚本使用的 python MPI 包装器使用的 MPI 库之间有一些接口。而管理层是aprun（和qsub？）：

http://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/aprun-man-page/

https://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/

You must use the aprun command to launch jobs on the Hopper compute nodes. Use it for serial, MPI, OpenMP, UPC, and hybrid MPI/OpenMP or hybrid MPI/CAF jobs.

https://wickie.hlrs.de/platforms/index.php/CRAY_XE6_Using_the_Batch_System

The job launcher for the XE6 parallel jobs (both MPI and OpenMP) is aprun. ... The aprun example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 64 MPI processes with 32 processes placed on each of your allocated nodes (remember that a node consists of 32 cores in the XE6 system). You need to have nodes allocated by the batch system before (qsub).

aprun 和 qsub 与 MPI 之间有一些接口：在正常启动 (aprun -n 32 python /lustre/4_mpi4py/hello.py) 中，aprun 只是启动 MPI 程序的几个 (32) 个进程，设置 id接口中的每个进程并为它们提供组 ID（例如，使用 PMI_ID 之类的环境变量；实际变量特定于 launcher/MPI lib 组合）。

GNU parallel 没有任何 MPI 程序的接口，它对这些变量一无所知。它只会启动比预期多 8 倍的进程。并且您错误命令中的所有 32 * 8 个进程将具有相同的组 ID；并且将有 8 个具有相同 MPI 进程 ID 的进程。它们会使您的 MPI 库行为异常。

切勿将 MPI 资源管理器/启动器与 xargs 或 parallel 或 "very-advanced bash scripting for parallelism" 等古老的 MPI 之前的 unix 进程分叉器混合使用。有 MPI 用于并行执行某些操作；并且有 MPI launcher/job 管理（aprun、mpirun、mpiexec）用于启动多个进程/分叉/ssh-ing 到机器。

不要这样做 aprun -n 32 sh -c 'parallel anything_with_MPI' - 这是不受支持的组合。 aprun 唯一可能的（允许的）参数是一些支持并行性的程序，如 OpenMP、MPI、MPI+OpenMP 或非并行程序。（或启动一个并行程序的单个脚本）

如果要启动多个独立的 MPI 任务，请使用多个参数 aprun：aprun -n 8 ./program_to_process_file1 : -n 8 ./program_to_process_file2 -n 8 ./program_to_process_file3 -n 8 ./program_to_process_file4

如果您有多个文件要处理，请尝试启动多个并行作业，不要使用单个 qsub，而是使用多个并允许 PBS（或使用哪个作业管理器）来管理您的作业。

如果你有很多文件，尽量不要在你的程序中使用 MPI（永远不要 link MPI 库/包含 MPI 头文件）并使用 parallel 或其他形式的aprun 中隐藏的古代平行体。或者直接在您的代码中使用单个 MPI 程序和程序文件分发（MPI 的主进程可能会打开文件列表，然后在其他 MPI 进程之间分发文件 - 有或没有 MPI / mpi4py 的动态进程管理：http://pythonhosted.org/mpi4py/usrman/tutorial.html#dynamic-process-management）。 =41=]

一些科学家试图以其他顺序结合 MPI 和并行：parallel ... aprun ... 或 parallel ... mpirun ...：

https://rcc.uchicago.edu/docs/tutorials/kicp-tutorials/running-jobs.html#gnu-parallel
http://www.hpc.lsu.edu/training/weekly-materials/2017-Spring/gnuparallel-Feb2017.pdf#page=41
并且有适用于您的 Cray 的并行版本：https://github.com/levinas/cray-parallel

如何在 Cray XE6 计算节点（Unix like env）上使用带有 aprun 命令的 GNU 并行（bash 脚本）？

How to use GNU parallel (bash scripting) with aprun command on Cray XE6 compute nodes (Unix like env)?

parallel-processing

bash

gnu

pbs

cray