intel_pt 事件的采样率是多少,即 perf record -e intel_pt//?

What is the sampling rate for intel_pt event i.e., perf record -e intel_pt//?

可以使用 -Fperf record 命令设置采样率。我想知道 intel_pt 事件的采样率是多少,即命令

perf record -e intel_pt// -- ./a.out

在用户模式下 -F 允许的最大采样率为 8000。虽然 perf record 可能每秒存储几千次跟踪,但是使用 [= 记录的跟踪事件16=] 频率高得多。

换句话说,对于 intel_pt 事件,会收集应用程序执行的踪迹。 perf record 在使用 intel_pt 事件记录时是否有不同的工作方式,即在某些非采样模式下?

是的,perf record 的 intel_pt 模式不同且不相同 sampling (statistical) profiling with software (cpu-clock) or hardware (cycles) events. Sampling has 4000 of current EIP 每秒样本,并为您提供了对代码执行的基本不精确视图。 intel_pt 是一种基于硬件的跟踪技术,它会生成大量关于每个控制流指令的数据(在默认的 perf intel_pt 模式下),允许重建完整的控制流,但它有更大的开销。因此,Intel PT 的频率与程序代码每秒执行的调用、分支和 return 的频率相同(百万的 100)。

通过对硬件事件进行采样,perf record 将要求硬件 PMU 对一些事件进行计数,例如 CPU 周期,并在发生例如 200 万次此类事件后生成溢出中断。在这样的中断上,内核中的 perf_events 子系统将记录当前 OS 时间戳,当前线程的 pid/tid,指向环形缓冲区的 EIP 指令指针,并将 PMU 计数器重置为新值。 perf 子系统确实通过自动调整值来限制最大中断频率,并且 -F 选项可用于更改所需的中断频率。当环形缓冲区(大约几兆字节大小)被填满时,perf user-space 工具会将其内容转储到 perf.data 文件中,您可以使用 [=21= 查看原始数据] 或 perf script -D。或者只是用 perf report 制作直方图(根据 EIP 指令地址上中断的频率对 EIP 进行排序,这与该代码所用的时间成正比)。此模式每秒线程执行 (perf report --header | grep sample_freq) 大约有 4000 个事件,每个样本 48 字节,或每秒 192 KB。开销基本够低,但采样不够准确

perf wiki 有单独的英特尔处理器跟踪页面 (intel_pt) - https://perf.wiki.kernel.org/index.php/Perf_tools_support_for_Intel%C2%AE_Processor_Trace

Control flow tracing is different from other kinds of performance analysis and debugging. It provides fine-grained information on branches taken in a program, but that means there can be a vast amount of trace data. Such an enormous amount of trace data creates a number of challenges, but it raises the central question: how to reduce the amount of trace data that needs to be captured. That inverts the way performance analysis is normally done. Instead of taking a test case and creating a trace of it, you need first to create a test case that is suitable for tracing.

因此,intel_pt 是集成到 CPU 硬件中的跟踪(日志记录)模块,并且在武装时它将生成“每个 CPU 数百兆字节的跟踪数据]每秒”,根据使用的设置。使用某些设置,它可能会比写入磁盘甚至 RAM(“溢出数据包”)更快地生成跟踪数据(数据包日志)。根据 https://lwn.net/Articles/648154/ article, perf_events (kernel-mode) in intel_pt mode will just save full packet log into separate (bigger?) ring buffer and perf tool (user-space) will just periodically save data from ring buffer into file for offline filtering, parsing and decode. (Period of saving aux or ring mmap into the file is not the same as overflow interrupt frequency option -F) PT decoder then will be used to reconstruct PT packet log into perf-compatible samples. Log data volume is huge, overhead is 1% - 5% - 10% or more 取决于执行代码中的分支频率。

intel_pt 的文档是 manpage man perf-intel-pt 并且长文本存储在 linux 内核源代码中 https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/perf-intel-pt.txt

Intel PT is first supported in Intel Core M and 5th generation Intel Core processors that are based on the Intel micro-architecture code name Broadwell. Trace data is collected by 'perf record' and stored within the perf.data file. ... Trace data must be 'decoded' which involves walking the object code and matching the trace data packets. ... Decoding is done on-the-fly. The decoder outputs samples in the same format as samples output by perf hardware events, for example as though the "instructions" or "branches" events had been recorded. Presently 3 tools support this: 'perf script', 'perf report' and 'perf inject'. ... The main distinguishing feature of Intel PT is that the decoder can determine the exact flow of software execution. Intel PT can be used to understand why and how did software get to a certain point, or behave a certain way. ... A limitation of Intel PT is that it produces huge amounts of trace data (hundreds of megabytes per second per core) which takes a long time to decode

默认情况下 perf record -e intel_pt//-e intel_pt/tsc=1,noretcomp=0/ 相同。 manpage man perf-intel-ptconfig terms 部分说明什么是默认设置:

tsc Always supported. Produces TSC timestamp packets to provide timing information. In some cases it is possible to decode without timing information, for example a per-thread context that does not overlap executable memory maps.

noretcomp Always supported. Disables "return compression" so a TIP packet is produced when a function returns. Causes more packets to be produced but might make decoding more reliable.

pt Specifies pass-through which enables the branch config term.

branch Enable branch tracing. Branch tracing is enabled by default

To represent software control flow, "branches" samples are produced. By default a branch sample is synthesized for every single branch.

正如它所说,默认模式下的intel_pt用于产生控制流日志,通过要求硬件为每个控制流指令(如调用,分支,return生成日志包,并到添加时间戳以将 pt 日志与某些服务 perf 样本同步(如 exec 或 mmap 以查找正在加载到内存中的实际代码)。它尽量不生成太多,例如[每个条件分支使用单个位 (tnt)](https://conference.hitb.org/hitbsecconf2017ams/materials/D1T1 - Richard Johnson - Harnessing Intel Processor Trace on Windows for Vulnerability Discovery.pdf#page=12) and several bytes per indirect branch,但许多程序每秒有数亿个分支。

一些有用且 简短的 性能幻灯片 + intel_pt:

更新:虽然intel pt跟踪日志有完整的跟踪(每个branch/call/return里面都有数据包),perf report确实运行像经典perf.data一样从pt日志转换成样本集,样本集中有采样率。这是使用 perf report--itrace 选项配置的(iNNTT,其中 NN 是数量,TT 是类型 - i/t/us/ns,如 described in man page of perf-report:

   --itrace
       Options for decoding instruction tracing data. The options are:
           i       synthesize instructions events
           g       synthesize a call chain (use with i or x)
           The default is all events i.e. the same as --itrace=ibxwpe,
           In addition, the period (default 100000, ...)
           for instructions events can be specified in units of:

           i       instructions
           t       ticks
           ms      milliseconds
           us      microseconds
           ns      nanoseconds (default)

所以似乎默认情况下 perf report 会以 100000 条指令的采样率将完整的跟踪日志转换为指令样本(每 10 万条指令生成 1 个 perf 样本)。可以改成更高的速率,但是处理时间会增加。

Manpage of perf-intel-pt 给出了 itrace 选项用法的更多示例:

   Because samples are synthesized after-the-fact, the sampling period
   can be selected for reporting. e.g. sample every microsecond

       sudo perf report pt_ls --itrace=i1usge

   See the sections below for more information about the --itrace
   option.

   Beware the smaller the period, the more samples that are produced,
   and the longer it takes to process them.

   Also note that the coarseness of Intel PT timing information will
   start to distort the statistical value of the sampling as the
   sampling period becomes smaller.

   To see every possible IPC value, "instructions" events can be used
   e.g. --itrace=i0ns


       --itrace=i10us

   sets the period to 10us i.e. one instruction sample is synthesized
   for each 10 microseconds of trace. Alternatives to "us" are "ms"
   (milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
   (instructions).

   For Intel PT, the default period is 100us.


   Setting it to a zero period means "as often as possible".

   In the case of Intel PT that is the same as a period of 1 and a unit
   of instructions (i.e. --itrace=i1i).

http://halobates.de/blog/p/410 有一些额外的复杂转换示例:

perf script --ns --itrace=cr

Record program execution and display function call graph.

perf script by defaults “samples” the data (only dumps a sample every 100us). This can be configured using the --itrace option (see reference below)

 perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64

Show every assembly instruction executed with disassembler.

 perf report --itrace=g32l64i100us --branch-history

Print hot paths every 100us as call graph histograms

perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
google-chrome workload.svg

Generate flame graph from execution, sampled every 100us