在循环中调用 MPI_Reduce 时出现致命错误

Getting Fatal error when calling MPI_Reduce inside a loop

我这部分代码有问题(任务之间很常见):

for (i = 0; i < m; i++) {
    // some code
    MPI_Reduce(&res, &mn, 1, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
    // some code
}

这工作正常,但对于较大的 m 值,我收到此错误:

    Fatal error in PMPI_Reduce: Other MPI error, error stack:
    PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
    MPIR_Reduce(764)..........................:
    MPIR_Reduce_binomial(207).................:
    MPIC_Send(41).............................:
    MPIC_Wait(513)............................:
    MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
    MPIDI_CH3I_Progress_handle_sock_event(436):
    MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
    
    job aborted:
    rank: node: exit code[: error message]
    0: AmirDiab: 1
    1: AmirDiab: 1
    2: AmirDiab: 1: Fatal error in PMPI_Reduce: Other MPI error, error stack:
    PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
    MPIR_Reduce(764)..........................:
    MPIR_Reduce_binomial(207).................:
    MPIC_Send(41).............................:
    MPIC_Wait(513)............................:
    MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
    MPIDI_CH3I_Progress_handle_sock_event(436):
    MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
    3: AmirDiab: 1

有什么建议吗?

您的通信模式似乎让 MPI 负担过重。请注意 261895 unexpected messages queued 错误消息。那是相当多的消息。由于 MPI 尝试急切地为小消息(如您​​的单元素缩减)发送数据,运行 数十万个 MPI_Reduce 循环调用可能会在太多消息正在传输时导致资源耗尽。

如果可能,请尝试重新安排您的算法,以便在一次缩减中处理所有 m 个元素,而不是迭代它们:

int* res = malloc(m * sizeof(int));
int* ms  = malloc(m * sizeof(int));

for (i = 0; i < m; ++i) {
    ms[i] = /* ... */
}

MPI_Reduce(res, ms, m, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);

或者,如评论中所建议,您可以在循环中每隔一段时间添加 MPI_Barrier() 调用以限制未完成消息的数量。