如何启用 CUDA Aware OpenMPI?

How to enable CUDA Aware OpenMPI?

我正在使用 OpenMPI,我需要启用 CUDA 感知 MPI。与 MPI 一起,我将 OpenACC 与 hpc_sdk 软件一起使用。

https://www.open-mpi.org/faq/?category=buildcuda 之后,我使用

下载并安装了 UCX(不是 gdrcopy,我还没有成功安装)

./contrib/configure-release --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC=pgcc CXX=pgc++ --disable-fortran

并打印:

checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking cuda_runtime.h usability... yes
checking cuda_runtime.h presence... yes
checking for cuda_runtime.h... yes

所以UCX好像没问题。 在此之后,我重新配置了 OpenMPI:

./configure --with-ucx=/home/marco/Downloads/ucx-1.9.0/install --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC =pgcc CXX=pgc++ --disable-mpi-fortran

并打印:

CUDA support: yes
Open UCX: yes

如果我尝试 运行 应用程序:mpi运行 -np 2 -mca pml ucx -x ./a.out (根据 openucx.org 上的建议)我收到错误:

match_arg (utils/args/args.c:163): unrecognized argument mca
HYDU_parse_array (utils/args/args.c:178): argument matching returned error
parse_args (ui/mpich/utils.c:1642): error parsing input array
HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
main (ui/mpich/mpiexec.c:148): error parsing parameters

我看到编译器寻找的目录不是OpenMPI的目录而是MPICH的目录,我不知道为什么。如果我输入 which mpiccwhich mpiexecwhich mpirun 我会得到 OpenMPI 的。

如果我 运行 使用:mpiexec -n 2 ./a.out 我得到:

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

已编辑:

做同样的事情,但使用 NVIDIA HPC SDK 附带的 OpenMPI-4.0.5 进行编译,运行没问题。

我得到:

[marco-Inspiron-7501:1356251:0:1356251] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f05cfafa000)
==== backtrace (tid:1356251) ====
 0  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7f060ae06dc7]
 1  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7f060ae06b87]
 2  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7f060ae06ce4]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f060c7433c0]
 4  /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7f060befb885]
 5  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7f060b2bd9e6]
 6  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7f060b2bd775]
 7  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7f060b2d35b5]
 8  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7f060b2d111d]
 9  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7f060b055577]
10  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7f060b054725]
11  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7f060b2d3614]
12  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7f060b2d22c7]
13  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7f060b2d15b1]
14  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7f060b2e85bd]
15  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7f060b2e7d15]
16  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7f060b2e721a]
17  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7f060b2e65ac]
18  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7f060dfc3b33]
=================================
[marco-Inspiron-7501:1356252:0:1356252] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd7f7afa000)
==== backtrace (tid:1356252) ====
 0  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7fd82a711dc7]
 1  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7fd82a711b87]
 2  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7fd82a711ce4]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fd82c04e3c0]
 4  /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7fd82b806885]
 5  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7fd82abc89e6]
 6  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7fd82abc8775]
 7  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7fd82abde5b5]
 8  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7fd82abdc11d]
 9  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7fd82a960577]
10  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7fd82a95f725]
11  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7fd82abde614]
12  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7fd82abdd2c7]
13  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7fd82abdc5b1]
14  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7fd82abf35bd]
15  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7fd82abf2d15]
16  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7fd82abf221a]
17  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7fd82abf15ac]
18  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7fd82d8ceb33]
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node marco-Inspiron-7501 exited on signal 11 (Segmentation fault).

pragma acc host_data use_device(send_buf, recv_buf)

引起的错误
  double send_buf[NX_GLOB + 2*NGHOST];
  double recv_buf[NX_GLOB + 2*NGHOST];

  #pragma acc enter data create(send_buf[:NX_GLOB+2*NGHOST], recv_buf[NX_GLOB+2*NGHOST])

  // Top buffer
  j = jend;
  #pragma acc parallel loop present(phi[:ny_tot][:nx_tot], send_buf[:NX_GLOB+2*NGHOST])
  for (i = ibeg; i <= iend; i++) send_buf[i] = phi[j][i];
  #pragma acc host_data use_device(send_buf, recv_buf)
  {
  MPI_Sendrecv (send_buf, iend+1, MPI_DOUBLE, procR[1], 0,
                recv_buf, iend+1, MPI_DOUBLE, procR[1], 0,
                MPI_COMM_WORLD, MPI_STATUS_IGNORE);
  }

这是 20.7 版本中添加 UCX 支持时出现的问题。您可以将优化级别降低到 -O1 来解决该问题,或者将您的 NV HPC 编译器版本更新到我们已解决该问题的 20.9。

https://developer.nvidia.com/nvidia-hpc-sdk-version-209-downloads