GDB 调试缺少特定调用堆栈符号表的核心转储

Question

我遇到了这个奇怪的崩溃，我不知道如何调试核心转储，因为调用堆栈由于某种原因缺少符号信息，除了最后一个函数：

#0  BIH::intersectRay<VMAP::MapRayCallback> (this=0x7f47b8339608, r=..., intersectCallback=..., maxDist=@0x7f493af8383c: 0, stopAtFirst=true, los=<optimized out>) at ../BIH.h:223
#1  0x000000307ff00000 in ?? ()
#2  0x7ff0000000000000 in ?? ()
#3  0x0000000000000030 in ?? ()
#4  0x000000307ff00000 in ?? ()
#5  0x7ff0000000000000 in ?? ()
#6  0x0000000000000030 in ?? ()
#7  0x000000307ff00000 in ?? ()
#8  0x7ff0000000000000 in ?? ()
#9  0x0000000000000030 in ?? ()
#10 0x000000307ff00000 in ?? ()
#11 0x7ff0000000000000 in ?? ()
#12 0x0000000000000030 in ?? ()
#13 0x000000307ff00000 in ?? ()
#14 0x7ff0000000000000 in ?? ()
#15 0x0000000000000030 in ?? ()
#16 0x000000307ff00000 in ?? ()
#17 0x7ff0000000000000 in ?? ()
#18 0x0000000000000030 in ?? ()
#19 0x000000307ff00000 in ?? ()
#20 0x7ff0000000000000 in ?? ()
#21 0x0000000000000030 in ?? ()
#22 0x000000307ff00000 in ?? ()
....
#749 0x7ff0000000000000 in ?? ()
#750 0x0000000000000030 in ?? ()
#751 0x000000307ff00000 in ?? ()
#752 0x7ff0000000000000 in ?? ()
#753 0x0000000000000030 in ?? ()
#754 0x000000307ff00000 in ?? ()
#755 0x7ff0000000000000 in ?? ()
#756 0x0000000000000030 in ?? ()
#757 0x000000307ff00000 in ?? ()
#758 0x7ff0000000000000 in ?? ()
#759 0x0000000000000030 in ?? ()
#760 0x000000307ff00000 in ?? ()
#761 0x7ff0000000000000 in ?? ()
#762 0x0000000000000030 in ?? ()
#763 0x000000307ff00000 in ?? ()
#764 0x03010102464c457f in ?? ()
#765 0x0000000000000000 in ?? ()`


(gdb) info frame 0
Stack frame at 0x7f493af83830:
 rip = 0x930f0b in BIH::intersectRay<VMAP::MapRayCallback> (../BIH.h:223); saved rip = 0x307ff00000
 called by frame at 0x7f493af83838
 source language c++.
 Arglist at 0x7f493af83438, args: this=0x7f47b8339608, r=..., intersectCallback=..., maxDist=@0x7f493af8383c: 0, stopAtFirst=true, los=<optimized out>
 Locals at 0x7f493af83438, Previous frame's sp is 0x7f493af83830
 Saved registers:
  rbx at 0x7f493af837f8, rbp at 0x7f493af83800, r12 at 0x7f493af83808, r13 at 0x7f493af83810, r14 at 0x7f493af83818, r15 at 0x7f493af83820, rip at 0x7f493af83828

#1  0x000000307ff00000 in ?? ()
No symbol table info available.
(gdb) info frame 1
Stack frame at 0x7f493af83838:
 rip = 0x307ff00000; saved rip = 0x7ff0000000000000
 called by frame at 0x7f493af83840, caller of frame at 0x7f493af83830
 Arglist at 0x7f493af83828, args:
 Locals at 0x7f493af83828, Previous frame's sp is 0x7f493af83838
 Saved registers:
  rip at 0x7f493af83830

#2  0x7ff0000000000000 in ?? ()
No symbol table info available.
(gdb) info frame 2
Stack frame at 0x7f493af83840:
 rip = 0x7ff0000000000000; saved rip = 0x30
 called by frame at 0x7f493af83848, caller of frame at 0x7f493af83838
 Arglist at 0x7f493af83830, args:
 Locals at 0x7f493af83830, Previous frame's sp is 0x7f493af83840
 Saved registers:
  rip at 0x7f493af83838

#3  0x0000000000000030 in ?? ()
No symbol table info available.
(gdb) info frame 3
Stack frame at 0x7f493af83848:
 rip = 0x30; saved rip = 0x307ff00000
 called by frame at 0x7f493af83850, caller of frame at 0x7f493af83840
 Arglist at 0x7f493af83838, args:
 Locals at 0x7f493af83838, Previous frame's sp is 0x7f493af83848
 Saved registers:
  rip at 0x7f493af83840

#4  0x000000307ff00000 in ?? ()
No symbol table info available.
(gdb) info frame 4
Stack frame at 0x7f493af83850:
 rip = 0x307ff00000; saved rip = 0x7ff0000000000000
 called by frame at 0x7f493af83858, caller of frame at 0x7f493af83848
 Arglist at 0x7f493af83840, args:
 Locals at 0x7f493af83840, Previous frame's sp is 0x7f493af83850
 Saved registers:
  rip at 0x7f493af83848

代码是用-g -fvar-tracking -O2 -march=native编译的。

我有各种崩溃的各种转储，所有这些都有符号表工作并提供相关的调用堆栈和信息，但由于某种原因，这个特定的崩溃是神秘的。

我注意到的一件事是相同的地址编号一遍又一遍地重复，这可能是某种无限循环或某种正在破坏或溢出堆栈的递归吗？
如果是这样，是否有任何方法可以获取调用堆栈中最顶层的函数（例如，有任何方法可以超过帧 #765 或获取在触发溢出之前调用的函数）？

我无法将 $sp 或 jump 设置为任何地址，因为我无法调试和单步执行实时程序，只能分析核心转储。
我无法复制这种崩溃，它不时发生在生产中。 valgrind也是不可能的。

是否有任何 g++ 编译器选项或 gdb 标志可以帮助我解决这个问题？
任何关于如何调试此类问题的指示都将受到赞赏（如果可能的话）。

Answer 1

I have no idea how to debug the core dump since the call stack is missing symbols info for some reason

第 1 部分：

这种无意义的调用堆栈的最常见原因是生成核心转储的二进制文件与您用来实际分析核心的二进制文件不匹配。

如果您在 link 时使用了 --build-id，或者如果您的 GCC 默认配置为使用 linker 标志，那么您可以验证二进制匹配（或不匹配）core 使用此过程：

readelf -n /path/to/binary

这应该产生类似于以下的输出：

$ readelf -n /bin/sleep

Displaying notes found at file offset 0x00000254 with length 0x00000020:
  Owner                 Data size   Description
  GNU                  0x00000010   NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 2.6.24

Displaying notes found at file offset 0x00000274 with length 0x00000024:
  Owner                 Data size   Description
  GNU                  0x00000014   NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: c266a51e4b85b16ca17bff8328f3abeafb577b29

build-id 字符串 c266a51e4b85b16ca17bff8328f3abeafb577b29 是您关心的输出。假设你的二进制文件有它，安装 elfutils 包，然后使用

eu-unstrip -n --core /path/to/core

查看在生成核心转储时使用了哪些二进制文件。

输出应如下所示：

$ eu-unstrip -n --core /tmp/core
0x400000+0x208000 c266a51e4b85b16ca17bff8328f3abeafb577b29@0x400284 - - [exe]
0x7ffca5721000+0x1000 9c7cbcf6c957d8fc8e55b45a3c7a1556b38a3097@0x7ffca5721340 . - linux-vdso.so.1
0x7f491ad5a000+0x2241c8 d0f537904076d73f29e4a37341f8a449e2ef6cd0@0x7f491ad5a1d8 /lib64/ld-linux-x86-64.so.2 /usr/lib/debug/lib/x86_64-linux-gnu/ld-2.19.so ld-linux-x86-64.so.2
0x7f491a995000+0x3c42c0 cf699a15caae64f50311fc4655b86dc39a479789@0x7f491a995280 /lib/x86_64-linux-gnu/libc.so.6 /usr/lib/debug/lib/x86_64-linux-gnu/libc-2.19.so libc.so.6

在上面你可以看到这个 core 转储实际上是由 /bin/sleep 生成的。

如果 core 中的可执行文件 build-id 与您的二进制文件不匹配，您需要找到 build-id 与您的 core 匹配的二进制文件，然后才能提取更正 GDB 中的崩溃堆栈跟踪。

第 2 部分：

如果二进制与 core 匹配，那么很可能堆栈只是损坏了（例如由于堆栈缓冲区溢出）。

valgrind is out of the question.

Valgrind 在检测堆栈损坏方面异常。

调试此类问题的当前技术水平是 Address Sanitizer，速度要快得多，在生产中可能快到运行。

如果清理过的二进制文件的速度不够快以供生产使用，您可以将其设置为处理 "shadow mode" 中的某些输入子集（二进制文件运行s，但它的输出被丢弃）。您在此类设置中付出的任何努力都可能会发现 10 多个新错误，并且会为您节省大量的未来调试工作。

GDB 调试缺少特定调用堆栈符号表的核心转储

GDB debugging a coredump with missing symbol tables for a specific call stack

c++

linux

gdb

coredump