在循环中调用 MPI_Reduce 时出现致命错误
Getting Fatal error when calling MPI_Reduce inside a loop
我这部分代码有问题(任务之间很常见):
for (i = 0; i < m; i++) {
// some code
MPI_Reduce(&res, &mn, 1, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
// some code
}
这工作正常,但对于较大的 m
值,我收到此错误:
Fatal error in PMPI_Reduce: Other MPI error, error stack:
PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce(764)..........................:
MPIR_Reduce_binomial(207).................:
MPIC_Send(41).............................:
MPIC_Wait(513)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(436):
MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
job aborted:
rank: node: exit code[: error message]
0: AmirDiab: 1
1: AmirDiab: 1
2: AmirDiab: 1: Fatal error in PMPI_Reduce: Other MPI error, error stack:
PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce(764)..........................:
MPIR_Reduce_binomial(207).................:
MPIC_Send(41).............................:
MPIC_Wait(513)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(436):
MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
3: AmirDiab: 1
有什么建议吗?
您的通信模式似乎让 MPI 负担过重。请注意 261895 unexpected messages queued
错误消息。那是相当多的消息。由于 MPI 尝试急切地为小消息(如您的单元素缩减)发送数据,运行 数十万个 MPI_Reduce
循环调用可能会在太多消息正在传输时导致资源耗尽。
如果可能,请尝试重新安排您的算法,以便在一次缩减中处理所有 m
个元素,而不是迭代它们:
int* res = malloc(m * sizeof(int));
int* ms = malloc(m * sizeof(int));
for (i = 0; i < m; ++i) {
ms[i] = /* ... */
}
MPI_Reduce(res, ms, m, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
或者,如评论中所建议,您可以在循环中每隔一段时间添加 MPI_Barrier()
调用以限制未完成消息的数量。
我这部分代码有问题(任务之间很常见):
for (i = 0; i < m; i++) {
// some code
MPI_Reduce(&res, &mn, 1, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
// some code
}
这工作正常,但对于较大的 m
值,我收到此错误:
Fatal error in PMPI_Reduce: Other MPI error, error stack:
PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce(764)..........................:
MPIR_Reduce_binomial(207).................:
MPIC_Send(41).............................:
MPIC_Wait(513)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(436):
MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
job aborted:
rank: node: exit code[: error message]
0: AmirDiab: 1
1: AmirDiab: 1
2: AmirDiab: 1: Fatal error in PMPI_Reduce: Other MPI error, error stack:
PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce(764)..........................:
MPIR_Reduce_binomial(207).................:
MPIC_Send(41).............................:
MPIC_Wait(513)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(436):
MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
3: AmirDiab: 1
有什么建议吗?
您的通信模式似乎让 MPI 负担过重。请注意 261895 unexpected messages queued
错误消息。那是相当多的消息。由于 MPI 尝试急切地为小消息(如您的单元素缩减)发送数据,运行 数十万个 MPI_Reduce
循环调用可能会在太多消息正在传输时导致资源耗尽。
如果可能,请尝试重新安排您的算法,以便在一次缩减中处理所有 m
个元素,而不是迭代它们:
int* res = malloc(m * sizeof(int));
int* ms = malloc(m * sizeof(int));
for (i = 0; i < m; ++i) {
ms[i] = /* ... */
}
MPI_Reduce(res, ms, m, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
或者,如评论中所建议,您可以在循环中每隔一段时间添加 MPI_Barrier()
调用以限制未完成消息的数量。