Perf 中的奇怪回溯
Weird Backtrace in Perf
我使用以下命令在一个简单的 evince
基准测试中提取通向用户级别 L3-misses
的回溯:
sudo perf record -d --call-graph dwarf -c 10000 -e mem_load_uops_retired.l3_miss:uppp /opt/evince-3.28.4/bin/evince
很明显,采样周期非常大(连续样本之间有 10000 个事件)。对于这个实验,perf script
的输出有一些类似于这个的样本:
EvJobScheduler 27529 26441.375932: 10000 mem_load_uops_retired.l3_miss:uppp: 7fffcd5d8ec0 5080022 N/A|SNP N/A|TLB N/A|LCK N/A
7ffff17bec7f bits_image_fetch_separable_convolution_affine+0x2df (inlined)
7ffff17bec7f bits_image_fetch_separable_convolution_affine_pad_x8r8g8b8+0x2df (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
7ffff17d1fd1 general_composite_rect+0x301 (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
ffffffffffffffff [unknown] ([unknown])
在回溯的底部,有一个叫做[unknown]
的符号,看起来没问题。但是随后调用了 general_composite_rect()
中的一行。这个回溯可以吗?
AFAIK,回溯中的第一个调用者应该类似于 _start()
或 __GI___clone()
。但是回溯不是这种形式。怎么了?
有什么办法可以解决这个问题吗?被截断的(部分)回溯是否可靠?
TL;DR perf 回溯过程可能会在某些函数处停止,如果堆栈中没有保存帧指针或 dwarf 方法没有 CFI 表。使用 -fno-omit-frame-pointer
或 -g
重新编译库或获取调试信息。使用发布二进制文件和库 perf 通常会提前停止回溯,而没有机会达到 main()
或 _start
或 clone()/start_thread()
顶级功能。
perf
Linux 中的分析工具是统计采样分析器(没有二进制检测):它对软件定时器或事件源或硬件性能监控单元 (PMU) 进行编程以生成周期性中断。在你的例子中
-c 10000 -e mem_load_uops_retired.l3_miss:uppp
用于在某种 PEBS 模式(https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR) to generate interrupt after 10000 of mem_load_uops_retired (with l3_miss mask). Generated interrupt is handled by Linux Kernel (perf_events subsystem, kernel/events and arch/x86/events)中 x86_64 中的 select 硬件 PMU。在此处理程序中,PMU 被重置(重新编程)以在生成 10000 个以上的事件和样本后生成下一个中断。通过perf report
命令将样本数据转储保存到perf.data文件中,但每次唤醒工具都可以保存数千个样本;示例可以通过 perf script
或 perf script -D
.
读取
perf_events 中断处理程序,接近 __perf_event_overflow
of kernel/events/core.c, has full access to the registers of current function, and has some time to do additional data retrieval to record current time, pid, etc. Part of such process is https://en.wikipedia.org/wiki/Call_stack 数据收集的东西。但是使用 x86_64 和 -fomit-frame-pointer(通常为 Debian/Ubuntu/others 的许多系统库启用)在寄存器或函数堆栈中没有默认位置来存储帧指针:
-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and
restore frame pointers; it also makes an extra register available in
many functions. It also makes debugging impossible on some machines.
Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86
targets has been changed to -fomit-frame-pointer. The default can be
reverted to -fno-omit-frame-pointer by configuring GCC with the
--enable-frame-pointer configure option.
使用保存在函数栈中的帧指针backtracing/unwinding很容易。但是对于某些功能,现代 gcc(和其他编译器)可能不会生成帧指针。因此,像 perf_events 处理程序中的回溯代码要么会在此类函数处停止回溯,要么需要另一种帧指针恢复方法。 perf record
select 的选项 -g method
(--call-graph
) 要使用的方法。它记录在 man perf-record
http://man7.org/linux/man-pages/man1/perf-record.1.html:
--call-graph
Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is "fp".
Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI -
Call Frame Information) or "lbr" (Hardware Last Branch Record
facility) as the method to collect the information used to show the
call graphs.
In some systems, where binaries are build with gcc
--fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the
libunwind or libdw library) should be used instead. Using the "lbr"
method doesn't require any compiler options. It will produce call
graphs from the hardware LBR registers. The main limitation is that
it is only available on new Intel platforms, such as Haswell. It
can only get user call chain. It doesn't work with branch stack
sampling at the same time.
When "dwarf" recording is used, perf also records (user) stack dump
when sampled. Default size of the stack dump is 8192 (bytes). User
can change the size by passing the size after comma like
"--call-graph dwarf,4096".
因此,dwarf 方法重新使用 CFI 表来查找堆栈帧大小并查找调用者的堆栈帧。我不确定默认情况下是否从发布库中删除了 CFI 表;但 debuginfo 可能会有它们。 LBR 无济于事,因为它是相当短的硬件缓冲区。 Dwarf 拆分处理(内核处理程序保存部分堆栈,perf user-space 工具将使用 libdw+libunwind 解析它)可能会丢失部分调用堆栈,因此也请尝试使用 [=27= 增加 dwarf 堆栈转储] 或 --call-graph dwarf,81920
等
回溯在 perf_events 的 arch-dependent 部分实现:arch/x86/events/core.c:perf_callchain_user()
; called from kernel/events/callchain.c:get_perf_callchain()
<- perf_callchain <- perf_prepare_sample <-
__perf_event_output <- *(event->overflow_handler)
<- READ_ONCE(event->overflow_handler)(event, data, regs);
of __perf_event_overflow
.
Gregg 确实警告过 perf 的调用堆栈不完整:http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html
Incomplete stacks usually mean -fomit-frame-pointer was used – a compiler optimization that makes little positive difference in the real world, but breaks stack profilers. Always compile with -fno-omit-frame-pointer. More recent perf has a -g dwarf option, to use the alternate libunwind/dwarf method for retrieving stacks.
我也写过关于 perf 中的回溯的一些附加链接:
我有同样的问题,它是这样的:当你用 --call-graph dwarf
收集跟踪时,如果堆栈的大小太大,你将在堆栈回溯中得到 unknown
.
默认最大堆栈大小为 8kB,但可以像这样增加,--call-graph dwarf,16578
。不幸的是,当您增加堆栈大小时,perf 还会出现一些其他问题。在我的例子中,解决方案是通过在堆上分配一个大的堆栈分配数组来摆脱它。
我使用以下命令在一个简单的 evince
基准测试中提取通向用户级别 L3-misses
的回溯:
sudo perf record -d --call-graph dwarf -c 10000 -e mem_load_uops_retired.l3_miss:uppp /opt/evince-3.28.4/bin/evince
很明显,采样周期非常大(连续样本之间有 10000 个事件)。对于这个实验,perf script
的输出有一些类似于这个的样本:
EvJobScheduler 27529 26441.375932: 10000 mem_load_uops_retired.l3_miss:uppp: 7fffcd5d8ec0 5080022 N/A|SNP N/A|TLB N/A|LCK N/A
7ffff17bec7f bits_image_fetch_separable_convolution_affine+0x2df (inlined)
7ffff17bec7f bits_image_fetch_separable_convolution_affine_pad_x8r8g8b8+0x2df (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
7ffff17d1fd1 general_composite_rect+0x301 (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
ffffffffffffffff [unknown] ([unknown])
在回溯的底部,有一个叫做[unknown]
的符号,看起来没问题。但是随后调用了 general_composite_rect()
中的一行。这个回溯可以吗?
AFAIK,回溯中的第一个调用者应该类似于 _start()
或 __GI___clone()
。但是回溯不是这种形式。怎么了?
有什么办法可以解决这个问题吗?被截断的(部分)回溯是否可靠?
TL;DR perf 回溯过程可能会在某些函数处停止,如果堆栈中没有保存帧指针或 dwarf 方法没有 CFI 表。使用 -fno-omit-frame-pointer
或 -g
重新编译库或获取调试信息。使用发布二进制文件和库 perf 通常会提前停止回溯,而没有机会达到 main()
或 _start
或 clone()/start_thread()
顶级功能。
perf
Linux 中的分析工具是统计采样分析器(没有二进制检测):它对软件定时器或事件源或硬件性能监控单元 (PMU) 进行编程以生成周期性中断。在你的例子中
-c 10000 -e mem_load_uops_retired.l3_miss:uppp
用于在某种 PEBS 模式(https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR) to generate interrupt after 10000 of mem_load_uops_retired (with l3_miss mask). Generated interrupt is handled by Linux Kernel (perf_events subsystem, kernel/events and arch/x86/events)中 x86_64 中的 select 硬件 PMU。在此处理程序中,PMU 被重置(重新编程)以在生成 10000 个以上的事件和样本后生成下一个中断。通过perf report
命令将样本数据转储保存到perf.data文件中,但每次唤醒工具都可以保存数千个样本;示例可以通过 perf script
或 perf script -D
.
perf_events 中断处理程序,接近 __perf_event_overflow
of kernel/events/core.c, has full access to the registers of current function, and has some time to do additional data retrieval to record current time, pid, etc. Part of such process is https://en.wikipedia.org/wiki/Call_stack 数据收集的东西。但是使用 x86_64 和 -fomit-frame-pointer(通常为 Debian/Ubuntu/others 的许多系统库启用)在寄存器或函数堆栈中没有默认位置来存储帧指针:
-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines.Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86 targets has been changed to -fomit-frame-pointer. The default can be reverted to -fno-omit-frame-pointer by configuring GCC with the --enable-frame-pointer configure option.
使用保存在函数栈中的帧指针backtracing/unwinding很容易。但是对于某些功能,现代 gcc(和其他编译器)可能不会生成帧指针。因此,像 perf_events 处理程序中的回溯代码要么会在此类函数处停止回溯,要么需要另一种帧指针恢复方法。 perf record
select 的选项 -g method
(--call-graph
) 要使用的方法。它记录在 man perf-record
http://man7.org/linux/man-pages/man1/perf-record.1.html:
--call-graph
Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is "fp".Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI - Call Frame Information) or "lbr" (Hardware Last Branch Record facility) as the method to collect the information used to show the call graphs.
In some systems, where binaries are build with gcc
--fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the libunwind or libdw library) should be used instead. Using the "lbr" method doesn't require any compiler options. It will produce call graphs from the hardware LBR registers. The main limitation is that it is only available on new Intel platforms, such as Haswell. It can only get user call chain. It doesn't work with branch stack sampling at the same time.When "dwarf" recording is used, perf also records (user) stack dump when sampled. Default size of the stack dump is 8192 (bytes). User can change the size by passing the size after comma like
"--call-graph dwarf,4096".
因此,dwarf 方法重新使用 CFI 表来查找堆栈帧大小并查找调用者的堆栈帧。我不确定默认情况下是否从发布库中删除了 CFI 表;但 debuginfo 可能会有它们。 LBR 无济于事,因为它是相当短的硬件缓冲区。 Dwarf 拆分处理(内核处理程序保存部分堆栈,perf user-space 工具将使用 libdw+libunwind 解析它)可能会丢失部分调用堆栈,因此也请尝试使用 [=27= 增加 dwarf 堆栈转储] 或 --call-graph dwarf,81920
等
回溯在 perf_events 的 arch-dependent 部分实现:arch/x86/events/core.c:perf_callchain_user()
; called from kernel/events/callchain.c:get_perf_callchain()
<- perf_callchain <- perf_prepare_sample <-
__perf_event_output <- *(event->overflow_handler)
<- READ_ONCE(event->overflow_handler)(event, data, regs);
of __perf_event_overflow
.
Gregg 确实警告过 perf 的调用堆栈不完整:http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html
Incomplete stacks usually mean -fomit-frame-pointer was used – a compiler optimization that makes little positive difference in the real world, but breaks stack profilers. Always compile with -fno-omit-frame-pointer. More recent perf has a -g dwarf option, to use the alternate libunwind/dwarf method for retrieving stacks.
我也写过关于 perf 中的回溯的一些附加链接:
我有同样的问题,它是这样的:当你用 --call-graph dwarf
收集跟踪时,如果堆栈的大小太大,你将在堆栈回溯中得到 unknown
.
默认最大堆栈大小为 8kB,但可以像这样增加,--call-graph dwarf,16578
。不幸的是,当您增加堆栈大小时,perf 还会出现一些其他问题。在我的例子中,解决方案是通过在堆上分配一个大的堆栈分配数组来摆脱它。