如何分析 C/C++ 应用程序中内存访问所花费的时间?
How to profile time spent in memory access in C/C++ applications?
一个函数在一个应用程序中花费的总时间可以大致分为两个部分:
- 实际计算所花费的时间 (Tcomp)
- 内存访问花费的时间 (Tmem)
通常情况下,探查器会提供函数花费的总时间的估计值。是否可以估算出上述两个组件(Tcomp 和 Tmem)所花费的时间?
无法对此进行测量(并且这样做没有任何意义),因为计算与当前处理器体系结构中的内存访问重叠。此外,访问内存通常会分解为更多步骤(访问内存、预取到各种缓存级别、实际读取处理器寄存器)。
您可以测量各种缓存级别的缓存命中率和未命中率,以使用 perf 及其硬件计数器(如果您的硬件支持)来估计算法在硬件上的效率。
Roofline 模型提出了算术强度 的概念:https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/。简单地说,它定义了每次内存访问执行的算术指令数。
计算算术强度通常是通过使用性能计数器来实现的。
如果您正在寻找获得 CPU 循环的函数,那么 boost 将非常有帮助。
我已经使用 Boost Timer Utility 来计算系统调用的 cpu 个周期。
另一方面,您可以将相同的函数放在完整的程序上以获得总时间。
希望这就是您要找的。
-维杰
Brendan Gregg 在他最近的博客 post CPU Utilization is Wrong 中建议使用每周期指令 PMC。简而言之,如果 IPC < 1.0,则应用程序可被视为内存受限。否则它可以被认为是指令绑定。这是他 post 的相关摘录:
If your IPC is < 1.0, you are likely memory stalled, and software
tuning strategies include reducing memory I/O, and improving CPU
caching and memory locality, especially on NUMA systems. Hardware
tuning includes using processors with larger CPU caches, and faster
memory, busses, and interconnects.
If your IPC is > 1.0, you are likely instruction bound. Look for ways
to reduce code execution: eliminate unnecessary work, cache
operations, etc. CPU flame graphs are a great tool for this
investigation. For hardware tuning, try a faster clock rate, and more
cores/hyperthreads.
For my above rules, I split on an IPC of 1.0. Where did I get that
from? I made it up, based on my prior work with PMCs. Here's how you
can get a value that's custom for your system and runtime: write two
dummy workloads, one that is CPU bound, and one memory bound. Measure
their IPC, then calculate their mid point.
以下是 stress tool 及其 IPC 生成的虚拟工作负载的一些示例。
内存绑定测试,IPC 低 (0,02):
$ perf stat stress --vm 4 -t 3
stress: info: [4520] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [4520] successful run completed in 3s
Performance counter stats for 'stress --vm 4 -t 3':
10767,074968 task-clock:u (msec) # 3,560 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
4 555 919 page-faults:u # 0,423 M/sec
4 290 929 426 cycles:u # 0,399 GHz
67 779 143 instructions:u # 0,02 insn per cycle
18 074 114 branches:u # 1,679 M/sec
5 398 branch-misses:u # 0,03% of all branches
3,024851934 seconds time elapsed
CPU绑定测试,IPC高(1,44):
$ perf stat stress --cpu 4 -t 3
stress: info: [4465] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
stress: info: [4465] successful run completed in 3s
Performance counter stats for 'stress --cpu 4 -t 3':
11419,683671 task-clock:u (msec) # 3,805 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
108 page-faults:u # 0,009 K/sec
30 562 187 954 cycles:u # 2,676 GHz
43 995 290 836 instructions:u # 1,44 insn per cycle
13 043 425 872 branches:u # 1142,188 M/sec
26 312 747 branch-misses:u # 0,20% of all branches
3,001218526 seconds time elapsed
一个函数在一个应用程序中花费的总时间可以大致分为两个部分:
- 实际计算所花费的时间 (Tcomp)
- 内存访问花费的时间 (Tmem)
通常情况下,探查器会提供函数花费的总时间的估计值。是否可以估算出上述两个组件(Tcomp 和 Tmem)所花费的时间?
无法对此进行测量(并且这样做没有任何意义),因为计算与当前处理器体系结构中的内存访问重叠。此外,访问内存通常会分解为更多步骤(访问内存、预取到各种缓存级别、实际读取处理器寄存器)。
您可以测量各种缓存级别的缓存命中率和未命中率,以使用 perf 及其硬件计数器(如果您的硬件支持)来估计算法在硬件上的效率。
Roofline 模型提出了算术强度 的概念:https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/。简单地说,它定义了每次内存访问执行的算术指令数。
计算算术强度通常是通过使用性能计数器来实现的。
如果您正在寻找获得 CPU 循环的函数,那么 boost 将非常有帮助。 我已经使用 Boost Timer Utility 来计算系统调用的 cpu 个周期。
另一方面,您可以将相同的函数放在完整的程序上以获得总时间。
希望这就是您要找的。 -维杰
Brendan Gregg 在他最近的博客 post CPU Utilization is Wrong 中建议使用每周期指令 PMC。简而言之,如果 IPC < 1.0,则应用程序可被视为内存受限。否则它可以被认为是指令绑定。这是他 post 的相关摘录:
If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.
If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.
For my above rules, I split on an IPC of 1.0. Where did I get that from? I made it up, based on my prior work with PMCs. Here's how you can get a value that's custom for your system and runtime: write two dummy workloads, one that is CPU bound, and one memory bound. Measure their IPC, then calculate their mid point.
以下是 stress tool 及其 IPC 生成的虚拟工作负载的一些示例。
内存绑定测试,IPC 低 (0,02):
$ perf stat stress --vm 4 -t 3
stress: info: [4520] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [4520] successful run completed in 3s
Performance counter stats for 'stress --vm 4 -t 3':
10767,074968 task-clock:u (msec) # 3,560 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
4 555 919 page-faults:u # 0,423 M/sec
4 290 929 426 cycles:u # 0,399 GHz
67 779 143 instructions:u # 0,02 insn per cycle
18 074 114 branches:u # 1,679 M/sec
5 398 branch-misses:u # 0,03% of all branches
3,024851934 seconds time elapsed
CPU绑定测试,IPC高(1,44):
$ perf stat stress --cpu 4 -t 3
stress: info: [4465] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
stress: info: [4465] successful run completed in 3s
Performance counter stats for 'stress --cpu 4 -t 3':
11419,683671 task-clock:u (msec) # 3,805 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
108 page-faults:u # 0,009 K/sec
30 562 187 954 cycles:u # 2,676 GHz
43 995 290 836 instructions:u # 1,44 insn per cycle
13 043 425 872 branches:u # 1142,188 M/sec
26 312 747 branch-misses:u # 0,20% of all branches
3,001218526 seconds time elapsed