使用 final 减少虚方法的开销

Question

我遇到了关于如何使用 "final" 关键字来减少虚方法开销的 SO 问题 ()。基于这个答案，期望派生的 class 指针调用用 final 标记的重写方法不会面临动态调度的开销。

为了对这种方法的优势进行基准测试，我在 Quick-Bench - Here is the link 上设置了一些示例 classes 和运行。这里有3个案例：
案例 1：派生的 class 指针没有最终说明符：

Derived* f = new DerivedWithoutFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

情况 2：带最终说明符的基 class 指针：

Base* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

情况 3：派生的 class 指针带有最终说明符：

Derived* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

函数 run_multiple 如下所示：

int run_multiple(int times) specifiers {
    int sum = 0;
    for(int i = 0; i < times; i++) {
        sum += run_once();
    }
    return sum;
}

我观察到的结果是：
按速度：案例 2 == 案例 3 > 案例 1

但是案例 3 不应该比案例 2 快很多吗？我的实验设计或我对预期结果的假设有什么问题吗？

编辑： Peter Cordes 指出了一些非常有用的文章以进一步阅读与该主题相关的文章：

Why can't gcc devirtualize this function call?
LTO, Devirtualization, and Virtual Tables

Answer 1

您正确理解了 final 的影响（情况 2 的内部循环可能除外），但您的成本估算有很大偏差。我们不应该期望在任何地方都会产生很大的影响，因为 mt19937 非常慢，而且所有 3 个版本都花费了大部分时间。

唯一没有丢失/埋没在噪声/开销中的是将 int run_once() override final 内联到 [=14] 中的 inner 循环的效果=]，案例 2 和案例 3 运行.

但是情况 1 无法将 Foo::run_once() 内联到 Foo::run_multiple()，因此与其他 2 种情况不同，内部循环中存在函数调用开销。

情况 2 必须重复调用 run_multiple，但每 100 运行秒 run_once 只调用一次，并且没有可衡量的效果。

对于所有 3 种情况，大部分 的时间花在了 dist(rng);，因为 std::mt19937 与不内联的额外开销相比相当慢函数调用。乱序执行也可能会隐藏很多开销。但不是全部，所以还有一些东西需要衡量。

案例 3 能够将所有内容内联到此 asm 循环（来自您的 quickbench link）：

 # percentages are *self* time, not including time spent in the PRNG
 # These are from QuickBench's perf report tab,
 #  presumably sample for core clock cycle perf events.
 # Take them with a grain of salt: superscalar + out-of-order exec
 #  makes it hard to blame one instruction for a clock cycle

   VirtualWithFinalCase2(benchmark::State&):   # case 3 from QuickBench link
     ... setup before the loop
     .p2align 3
    .Louter:                # do{
       xor    %ebp,%ebp          # sum = 0
       mov    [=10=]x64,%ebx         # inner = 100
     .p2align 3  #  nopw   0x0(%rax,%rax,1)
     .Linner:                    # do {
51.82% mov    %r13,%rdi
       mov    %r15,%rsi
       mov    %r13,%rdx           # copy args from call-preserved regs
       callq  404d60              # mt PRNG for unsigned long
47.27% add    %eax,%ebp           # sum += run_once()
       add    [=10=]xffffffff,%ebx    # --inner
       jne    .Linner            # }while(inner);
       mov    %ebp,0x4(%rsp)     # store to volatile local:  benchmark::DoNotOptimize(x);
0.91%  add    [=10=]xffffffffffffffff,%r12   # --outer
       jne                    # } while(outer)

情况 2 仍然可以将 run_once 内联到 run_multiple，因为 class FooPlus 使用 int run_once() override final。外循环中有虚拟调度开销（仅），但是每次外循环迭代的这个小额外成本与内循环的成本（情况 2 和情况 3 之间相同）完全相形见绌。

所以 inner 循环本质上是相同的，只有外循环有间接调用开销。不足为奇的是，这在 Quickbench 上是无法测量的，或者至少在噪声中丢失了。

情况 1 无法将 Foo::run_once() 内联到 Foo::run_multiple()，因此那里也存在函数调用开销 。（它是一个间接函数调用的事实相对较小；在紧密循环中，分支预测将完成近乎完美的工作。）

案例 1 和案例 2 的外循环具有相同的汇编，如果您查看 Quick-Bench 上的反汇编 link。

两者都不能去虚拟化和内联 run_multiple。案例 1 因为它是虚拟的非最终版本，案例 2 因为它只是基础 class，而不是具有 final 覆盖的派生 class。

        # case 2 and case 1 *outer* loops
      .loop:                 # do {
       mov    (%r15),%rax     # load vtable pointer
       mov    [=11=]x64,%esi      # first C++ arg
       mov    %r15,%rdi       # this pointer = hidden first arg
       callq  *0x8(%rax)      # memory-indirect call through a vtable entry
       mov    %eax,0x4(%rsp)  # store the return value to a `volatile` local
       add    [=11=]xffffffffffffffff,%rbx      
       jne    4049f0 .loop   #  } while(--i != 0);

这可能是一个遗漏的优化：编译器可以证明 Base *f 来自 new FooPlus()，因此静态已知其类型为 FooPlus。 operator new 可以被覆盖，但编译器仍然发出对 FooPlus::FooPlus() 的单独调用（将它传递给 new 的存储指针）。所以这似乎只是 clang 在案例 2 和案例 1 中没有利用的演员。

使用 final 减少虚方法的开销

Using final to reduce virtual method overhead

c++

oop

benchmarking