"ORTE_ERROR_LOG: Not found in file dpm_orte.c at line 167" 导致使用 OpenMPI 的 Fortran 程序崩溃

"ORTE_ERROR_LOG: Not found in file dpm_orte.c at line 167" causes Fortran program utilizing OpenMPI to crash

我正在学习如何使用 OpenMPI 和 Fortran。通过使用 OpenMPI 文档,我尝试创建一个简单的 client/server 程序。但是,当我 运行 它时,我从客户端收到以下错误:

[Laptop:13402] [[54220,1],0] ORTE_ERROR_LOG: Not found in file dpm_orte.c at line 167
[Laptop:13402] *** An error occurred in MPI_Comm_connect
[Laptop:13402] *** reported by process [3553361921,0]
[Laptop:13402] *** on communicator MPI_COMM_WORLD
[Laptop:13402] *** MPI_ERR_INTERN: internal error
[Laptop:13402] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Laptop:13402] ***    and potentially your MPI job)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[54220,1],0]
  Exit code:    17
--------------------------------------------------------------------------

Server和Client的代码如下:

server.f90

program name
use mpi
implicit none

    ! type declaration statements
    INTEGER :: ierr, size, newcomm, loop, buf(255), status(MPI_STATUS_SIZE)
    CHARACTER(MPI_MAX_PORT_NAME) :: port_name

    ! executable statements
    call MPI_Init(ierr)
    call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
    call MPI_Open_port(MPI_INFO_NULL, port_name, ierr)
    print *, "Port name is: ", port_name

    do while (.true.)
        call MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, newcomm, ierr)

        loop = 1
        do while (loop .eq. 1)
            call MPI_Recv(buf, 255, MPI_INTEGER, MPI_ANY_SOURCE, MPI_ANY_TAG, newcomm, status, ierr)
            print *, "Looping the loop."
            loop = 0

        enddo

        call MPI_Comm_free(newcomm, ierr)
        call MPI_Close_port(port_name, ierr)
        call MPI_Finalize(ierr)    

    enddo

end program name

client.f90

program name
use mpi
implicit none

    ! type declaration statements
    INTEGER :: ierr, buf(255), tag, newcomm
    CHARACTER(MPI_MAX_PORT_NAME) :: port_name
    LOGICAL :: done

    ! executable statements
    call MPI_Init(ierr)
    print *, "Please provide me with the port name: "
    read(*,*) port_name

    call MPI_Comm_connect(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, newcomm, ierr)

    done = .false.
    do while (.not. done)
        tag = 0
        call MPI_Send(buf, 255, MPI_INTEGER, 0, tag, newcomm, ierr)
        done = .true.
    enddo

    call MPI_Send(buf, 0, MPI_INTEGER, 0, 1, newcomm, ierr)
    call MPI_Comm_Disconnect(newcomm, ierr)
    call MPI_Finalize(ierr)

end program name

我使用 mpif90 server.f90 -o server.outmpif90 client.f90 -o client.out 编译程序,mpiexec -np 1 server.outmpiexec -np 1 client.out 编译程序。错误发生在向客户端提供端口名称时(即当我在 read 后按回车键时)。

which dpm_orte.c returns dpm_orte.c not found

我 运行宁 Linux 我从 Arch Extra 安装了 OpenMPI 1.10.3-1。

这是一个微不足道的 Fortran 输入处理错误,与 MPI 没有任何关系(除了 Open MPI 输出的错误消息完全无法理解)。只需在 client.f90 中插入一行,即可在读取后立即打印 port_name 的值:

print *, "Please provide me with the port name: "
read(*,*) port_name
print *, port_name

实际端口名称类似于 2527592448.0;tcp://10.0.1.6,10.0.1.2,192.168.122.1,10.10.11.10:55837+2527592449.0;tcp://10.0.1.6,10.0.1.4,192.168.122.1,10.10.11.10::300,输出将是 2527592448.0。列表定向输入将 ; 视为分隔符并在其后停止读取,因此传递给 MPI_COMM_CONNECT 的端口地址不完整。

解决办法是把read(*,*) port_name换成

read(*,'(A)') port_name

另外,服务器中的循环写得不好。您不能多次调用 MPI_FINALIZE。并且关闭端口也是一个坏主意,因为您在之后立即调用 MPI_COMM_ACCEPT 。正确的循环是:

! executable statements
call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
call MPI_Open_port(MPI_INFO_NULL, port_name, ierr)
print *, "Port name is: ", port_name

do while (.true.)
   call MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, newcomm, ierr)

   loop = 1
   do while (loop .eq. 1)
      call MPI_Recv(buf, 255, MPI_INTEGER, MPI_ANY_SOURCE, MPI_ANY_TAG, newcomm, status, ierr)
      print *, "Looping the loop."
      loop = 0
   enddo

   call MPI_Comm_disconnect(newcomm, ierr)
   call MPI_Comm_free(newcomm, ierr)
enddo

call MPI_Close_port(port_name, ierr)
call MPI_Finalize(ierr)