使用 gdb backtrace 调试 MPI 代码

using gdb backtrace to debug MPI code

将 gdb 与回溯一起使用会得到以下输出,

[Thread debugging using libthread_db enabled]
[New Thread 0x2aaaaffd3700 (LWP 32109)]
[Thread 0x2aaaaffd3700 (LWP 32109) exited]
Detaching after fork from child process 32110.
Detaching after fork from child process 32111.
Detaching after fork from child process 32112.
Detaching after fork from child process 32113.
Detaching after fork from child process 32114.
Detaching after fork from child process 32115.
Detaching after fork from child process 32116.
Detaching after fork from child process 32117.
Detaching after fork from child process 32118.
Detaching after fork from child process 32119.
Detaching after fork from child process 32120.
Detaching after fork from child process 32121.
Detaching after fork from child process 32122.
Detaching after fork from child process 32123.
Detaching after fork from child process 32124.
Detaching after fork from child process 32125.
Detaching after fork from child process 32126.
Detaching after fork from child process 32127.
Detaching after fork from child process 32128.
Detaching after fork from child process 32129.
Detaching after fork from child process 32130.
Missing separate debuginfos, use: debuginfo-install     fftw-3.2.1-3.1.el6.x86_64 glibc-2.12-1.80.el6_3.5.x86_64 nss-pam-ldapd-0.7.5-14.el6_2.1.x86_64
Detaching after fork from child process 32131.
Detaching after fork from child process 32133.
Detaching after fork from child process 32134.
Detaching after fork from child process 32135.
Detaching after fork from child process 32136.
Detaching after fork from child process 32137.
Detaching after fork from child process 32138.
Detaching after fork from child process 32139.
Detaching after fork from child process 32140.
Detaching after fork from child process 32141.
Detaching after fork from child process 32142.
Detaching after fork from child process 32143.
Detaching after fork from child process 32144.

程序接收到信号 SIGFPE,算术异常。

0x00000000004a3104 in phase::Mobility::Average ()
#0  0x00000000004a3104 in phase::Mobility::Average ()
#1  0x00000000004a3523 in phase::Mobility::Average(phase::Field&, phase::BoundaryConditions&) ()
#2  0x000000000046fcda in phase::Diffusion::CalculateMobility(phase::Field&, phase::Composition&, phase::BoundaryConditions&, phase::Mobility&) ()
#3  0x0000000000441a3e in MyParallelism<MyParallelBlock>::Run() ()
#4  0x00000000004436dc in main ()

输出函数的顺序表示什么?我应该寻找输出的最后一个功能吗? 如何进一步缩小导致算术异常的行?

编辑 运行 -g 选项给出,

Program received signal SIGFPE, Arithmetic exception.
0x00000000004a5fa4 in phase::Mobility::Average ()
#0  0x00000000004a5fa4 in phase::Mobility::Average ()
#1  0x00000000004a63c3 in phase::Mobility::Average(phase::Field&, phase::BoundaryConditions&) ()
#2  0x0000000000472fea in phase::Diffusion::Mobility(phase::Field&, phase::Composition&, phase::BoundaryConditions&, phase::Mobility&) ()
#3  0x000000000042686e in MyParallelBlock::DoTimestep (this=0x7c9368)
    at Parallelism.cpp:100
#4  0x00000000004450d9 in MyParallelism<MyParallelBlock>::Run (
    this=0x7fffffffd2f0) at Parallelism.cpp:164
#5  0x0000000000446ad3 in main (argc=1, argv=0x7fffffffdcd8)
    at Parallelism.cpp:242

但算术异常的原因并没有缩小。这添加了异常在 运行 循环中的信息(这是已知的)。我期待函数 phase::Mobility::Average () 中的更多信息。 0x0000000000446ad3, 0x00000000004450d9 等数字的意义是什么?我可以从这些数字中得到一些信息吗?

gdb 堆栈跟踪按函数在调用堆栈中的顺序从上到下显示函数(而堆栈从下到上增长)。

如果 gdb 捕捉到算术异常段错误,导致错误的函数将显示在位置#0 在 gdb 的堆栈跟踪中。

为了得到错误发生的文件和行信息,用调试符号重新编译你的程序。使用编译器 -g 标志来执行此操作。确保至少重新编译那些声明和实现了失败函数(请参阅堆栈跟踪中的 #0)的文件。

在您的情况下,您必须使用 -g 选项重新编译实现 class/namespace phase::Mobility 的文件。