优化与未优化构建的 KCachegrind 输出

Question

I 运行 valgrind --tool=callgrind ./executable 上的可执行文件由以下代码生成：

#include <cstdlib>
#include <stdio.h>
using namespace std;

class XYZ{
public:
    int Count() const {return count;}
    void Count(int val){count = val;}
private:
    int count;
};

int main() {
    XYZ xyz;
    xyz.Count(10000);
    int sum = 0;
    for(int i = 0; i < xyz.Count(); i++){
//My interest is to see how the compiler optimizes the xyz.Count() call
        sum += i;
    }
    printf("Sum is %d\n", sum);
    return 0;
}

我使用以下选项制作 debug 构建：-fPIC -fno-strict-aliasing -fexceptions -g -std=c++14。 release 构建具有以下选项：-fPIC -fno-strict-aliasing -fexceptions -g -O2 -std=c++14.

运行 valgrind 生成两个转储文件。当在KCachegrind中查看这些文件（一个文件用于debug可执行文件，另一个文件用于release可执行文件）时，debug build是可以理解的，如下所示：

正如预期的那样，函数 XYZ::Count() const 被调用了 10001 次。然而，优化后的发布版本更难破译，而且根本不清楚该函数被调用了多少次。我知道函数调用可能是 inlined。但是如何知道它实际上已经被内联了呢？发布版本的调用图如下所示：

main().

似乎根本没有显示功能 XYZ::Count() const

我的问题是：

(1)如果不查看 debug/release 构建生成的汇编语言代码，并且通过使用 KCachegrind，如何计算出特定函数的次数，（在本例中 XYZ::Count() const) 被称为？在上面的发布构建调用图中，该函数甚至没有被调用一次。

(2)有没有办法理解 KCachegrind 为 release/optimized 构建提供的调用图和其他细节？我已经查看了 https://docs.kde.org/trunk5/en/kdesdk/kcachegrind/kcachegrind.pdf 上的 KCachegrind 手册，但我想知道是否有一些有用的 hacks/rules 经验值得人们在发布版本中寻找。

Answer 1

在 callgrind.out 文件中搜索 XYZ::Count() 以查看 valgrind 是否记录了此函数的任何事件。

grep "XYZ::Count()" callgrind.out | more

如果您在 callgrind 文件中找到函数名称，那么重要的是要知道 kcachegrind 隐藏了权重较小的函数。查看答案：Make callgrind show all function calls in the kcachegrind callgraph

Answer 2

valgrind 的输出很容易理解：正如 valgrind+kcachegrind 告诉你的那样，这个函数在发布版本中根本没有被调用。

问题是，你所说的叫什么意思？如果一个函数是内联的，它还是"called"吗？实际上，情况更复杂，乍一看，您的示例并非那么微不足道。

Count() 是否内嵌在发布版本中？当然，有点。优化期间的代码转换通常非常显着，就像您的情况一样 - 最好的判断方法是查看结果 assembler （此处为 clang）：

main:                                   # @main
        pushq   %rax
        leaq    .L.str(%rip), %rdi
        movl    995000, %esi         # imm = 0x2FADCF8
        xorl    %eax, %eax
        callq   printf@PLT
        xorl    %eax, %eax
        popq    %rcx
        retq
.L.str:
        .asciz  "Sum is %d\n"

可以看到，main根本没有执行for-loop，只是打印了结果（49995000），这是在优化过程中计算出来的，因为在 compile-time.

期间已知迭代次数

那么 Count() 是内联的吗？是的，在优化的第一步中的某个地方，但随后代码变得完全不同 - 在最终汇编程序中没有内联 Count() 的地方。

那么当我们 "hide" 来自编译器的迭代次数时会发生什么？例如。通过命令行传递它：

...
int main(int argc,  char* argv[]) {
   XYZ xyz;
   xyz.Count(atoi(argv[1]));
...

在结果 assembler 中，我们仍然没有遇到 for-loop，因为优化器可以计算出 Count() 的调用没有 [=82] =] 并优化整个事情：

main:                                   # @main
        pushq   %rbx
        movq    8(%rsi), %rdi
        xorl    %ebx, %ebx
        xorl    %esi, %esi
        movl    , %edx
        callq   strtol@PLT
        testl   %eax, %eax
        jle     .LBB0_2
        leal    -1(%rax), %ecx
        leal    -2(%rax), %edx
        imulq   %rcx, %rdx
        shrq    %rdx
        leal    -1(%rax,%rdx), %ebx
.LBB0_2:
        leaq    .L.str(%rip), %rdi
        xorl    %eax, %eax
        movl    %ebx, %esi
        callq   printf@PLT
        xorl    %eax, %eax
        popq    %rbx
        retq
.L.str:
        .asciz  "Sum is %d\n"

优化器得出公式 (n-1)*(n-2)/2 求和 i=0..n-1!

现在让我们在单独的翻译单元 class.cpp 中隐藏 Count() 的定义，这样优化器就看不到它的定义了：

class XYZ{
public:
    int Count() const;//definition in separate translation unit
...

现在我们得到 for-loop 并在每次迭代中调用 Count()，the assembler 最重要的部分是：

.L6:
        addl    %ebx, %ebp
        addl    , %ebx
.L3:
        movq    %r12, %rdi
        call    XYZ::Count() const@PLT
        cmpl    %eax, %ebx
        jl      .L6

在每个迭代步骤中，Count()（在 %rax 中）的结果与当前计数器（在 %ebx 中）进行比较。现在，如果我们使用 valgrind 运行它，我们可以在被调用者列表中看到 XYZ::Count() 被调用 10001 次。

然而，对于现代 tool-chains 来说，仅仅看到单个翻译单元的汇编器是不够的 - 有一个叫做 link-time-optimization 的东西。我们可以通过以下方式在某处构建来使用它：

gcc -fPIC -g -O2 -flto -o class.o -c class.cpp
gcc -fPIC -g -O2 -flto -o test.o  -c test.cpp
gcc -g -O2 -flto -o test_r class.o test.o

并且运行使用 valgrind 生成的可执行文件我们再次看到，Count() 没有被调用！

但是查看机器代码（这里我使用了 gcc，我的 clang-installation 似乎与 lto 有问题）：

00000000004004a0 <main>:
  4004a0:   48 83 ec 08             sub    [=16=]x8,%rsp
  4004a4:   48 8b 7e 08             mov    0x8(%rsi),%rdi
  4004a8:   ba 0a 00 00 00          mov    [=16=]xa,%edx
  4004ad:   31 f6                   xor    %esi,%esi
  4004af:   e8 bc ff ff ff          callq  400470 <strtol@plt>
  4004b4:   85 c0                   test   %eax,%eax
  4004b6:   7e 2b                   jle    4004e3 <main+0x43>
  4004b8:   89 c1                   mov    %eax,%ecx
  4004ba:   31 d2                   xor    %edx,%edx
  4004bc:   31 c0                   xor    %eax,%eax
  4004be:   66 90                   xchg   %ax,%ax
  4004c0:   01 c2                   add    %eax,%edx
  4004c2:   83 c0 01                add    [=16=]x1,%eax
  4004c5:   39 c8                   cmp    %ecx,%eax
  4004c7:   75 f7                   jne    4004c0 <main+0x20>
  4004c9:   48 8d 35 a4 01 00 00    lea    0x1a4(%rip),%rsi        # 400674 <_IO_stdin_used+0x4>
  4004d0:   bf 01 00 00 00          mov    [=16=]x1,%edi
  4004d5:   31 c0                   xor    %eax,%eax
  4004d7:   e8 a4 ff ff ff          callq  400480 <__printf_chk@plt>
  4004dc:   31 c0                   xor    %eax,%eax
  4004de:   48 83 c4 08             add    [=16=]x8,%rsp
  4004e2:   c3                      retq   
  4004e3:   31 d2                   xor    %edx,%edx
  4004e5:   eb e2                   jmp    4004c9 <main+0x29>
  4004e7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)

我们可以看到，对函数 Count() 的调用是内联的，但是 - 仍然有一个 for-loop（我想这是 gcc vs clang 的事情）。

但您最感兴趣的是：函数 Count() 只有一次 "called" - 它的值被保存到寄存器 %ecx 并且循环实际上只有：

  4004c0:   01 c2                   add    %eax,%edx
  4004c2:   83 c0 01                add    [=17=]x1,%eax
  4004c5:   39 c8                   cmp    %ecx,%eax
  4004c7:   75 f7                   jne    4004c0 <main+0x20>

如果 valgrind 运行带有选项 `--dump-instr=yes.

，您还可以在 Kcachegrid 的帮助下看到所有这些内容

优化与未优化构建的 KCachegrind 输出

KCachegrind output for optimized vs unoptimized builds

c++

optimization

valgrind

kcachegrind