perf 如何将事件关联到函数?
How does perf associate events to functions?
更准确地说,perf 工具如何将 PMU 事件关联到函数
我已经意识到,当内核 perf 子系统记录事件计数器时,它还会记录程序计数器 (PC),因此它可以将计数与函数相关联。
然而,要真正获得精细的结果,您需要以非常高的速率对计数器进行采样,否则您可能会将计数器与一组函数相关联。
但是读取计数器并将采样数据(计数器、PC、调用堆栈)写入 perf mmap space 非常麻烦。
我在一些资料中读到这种采样仅在 PMU 计数器溢出时发生,但这可能非常粗糙,除非我将计数器设置为非常快地溢出
我在这里错过了什么?
perf record
is statistical profiling tool, it either program hardware performance event monitor unit (PMU) to overflow after some number of counts (for example with -e cycles -c 1000000
write -1000000 to counter and enable counting cycles; with -F
or without freq/period argument it will autotune value), on overflow interrupt perf will reprogram it for next count. So it will have several hundreds or few thousands events per second. Or it can use OS timer interrupt (-e task-clock
) to get periodic samples. On every sample (or on interrupt from hardware PMU) perf will record current PC (EIP) and/or callstack; and it does not record current value of counter (check full dump of data stored in the perf.data with perf script
or perf script -D
; or code of sample event dumping - 有 sample->ip
但不是 PMU 的当前计数)。
perf report
将解析 perf.data 以获取其中记录的所有 PC。它将计算每台 PC 被采样的次数以构建直方图 [PC] -> sample_count
。每台 PC 都将与其所属的确切功能相关联(perf report 将解析内存映射,因为 mmap
事件也记录在 perf.data 中,打开每个使用的二进制文件,找到每个符号 table二进制)。
perf report
的实际代码在 linux/tools/perf/builtin-report.c
: cmd_report
/__cmd_report
-> perf_session__process_events
-> some magic -> process_sample_event
to record all mentioned in perf.data ip
(PC) values with hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
into histogram with hist_iter__report_callback
:
hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
. . . (perf/util/annotate.c) __symbol__inc_addr_samples
611 h->addr[offset]++;
然后它将输出收集的直方图 report__browse_hists
-> perf_evlist__tty_browse_hists
-> hists__fprintf_nr_sample_events(hists, rep, evname, stdout);
.
每个样本都已经与精确函数相关联(并且由于 out-of-order CPU 的性质和 not-precise PMU 溢出事件,其中的位不精确指令),这就是 statistical profiling works. When your program runs for short time (less than second) and/or you have too low sampling frequency, you may have few samples recorded in perf.data
. But if you has more than several hundreds samples, you can find most cpu-heavy functions (they probably have pareto rule和 运行s 大约占程序 运行 时间的百分之几十。当您想查看较小的函数(大约 运行ning 时间的百分之几)时,使用数千或数十或数千样本并进行一些统计估计(您将无法获得 运行s 的正确函数百分比有 100 或 1000 个样本时的 0.1%。
更准确地说,perf 工具如何将 PMU 事件关联到函数 我已经意识到,当内核 perf 子系统记录事件计数器时,它还会记录程序计数器 (PC),因此它可以将计数与函数相关联。
然而,要真正获得精细的结果,您需要以非常高的速率对计数器进行采样,否则您可能会将计数器与一组函数相关联。 但是读取计数器并将采样数据(计数器、PC、调用堆栈)写入 perf mmap space 非常麻烦。
我在一些资料中读到这种采样仅在 PMU 计数器溢出时发生,但这可能非常粗糙,除非我将计数器设置为非常快地溢出
我在这里错过了什么?
perf record
is statistical profiling tool, it either program hardware performance event monitor unit (PMU) to overflow after some number of counts (for example with -e cycles -c 1000000
write -1000000 to counter and enable counting cycles; with -F
or without freq/period argument it will autotune value), on overflow interrupt perf will reprogram it for next count. So it will have several hundreds or few thousands events per second. Or it can use OS timer interrupt (-e task-clock
) to get periodic samples. On every sample (or on interrupt from hardware PMU) perf will record current PC (EIP) and/or callstack; and it does not record current value of counter (check full dump of data stored in the perf.data with perf script
or perf script -D
; or code of sample event dumping - 有 sample->ip
但不是 PMU 的当前计数)。
perf report
将解析 perf.data 以获取其中记录的所有 PC。它将计算每台 PC 被采样的次数以构建直方图 [PC] -> sample_count
。每台 PC 都将与其所属的确切功能相关联(perf report 将解析内存映射,因为 mmap
事件也记录在 perf.data 中,打开每个使用的二进制文件,找到每个符号 table二进制)。
perf report
的实际代码在 linux/tools/perf/builtin-report.c
: cmd_report
/__cmd_report
-> perf_session__process_events
-> some magic -> process_sample_event
to record all mentioned in perf.data ip
(PC) values with hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
into histogram with hist_iter__report_callback
:
hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
. . . (perf/util/annotate.c) __symbol__inc_addr_samples
611 h->addr[offset]++;
然后它将输出收集的直方图 report__browse_hists
-> perf_evlist__tty_browse_hists
-> hists__fprintf_nr_sample_events(hists, rep, evname, stdout);
.
每个样本都已经与精确函数相关联(并且由于 out-of-order CPU 的性质和 not-precise PMU 溢出事件,其中的位不精确指令),这就是 statistical profiling works. When your program runs for short time (less than second) and/or you have too low sampling frequency, you may have few samples recorded in perf.data
. But if you has more than several hundreds samples, you can find most cpu-heavy functions (they probably have pareto rule和 运行s 大约占程序 运行 时间的百分之几十。当您想查看较小的函数(大约 运行ning 时间的百分之几)时,使用数千或数十或数千样本并进行一些统计估计(您将无法获得 运行s 的正确函数百分比有 100 或 1000 个样本时的 0.1%。