难以理解和比较 CPU 性能指标

Question

当运行ning toplev, from pmu-tools 在一个软件上（用 gcc 编译：gcc -g -O3）我得到这个输出：

FE             Frontend_Bound:                                          37.21 +-     0.00 % Slots                
BAD            Bad_Speculation:                                         23.62 +-     0.00 % Slots                
BE             Backend_Bound:                                            7.33 +-     0.00 % Slots below          
RET            Retiring:                                                31.82 +-     0.00 % Slots below          
FE             Frontend_Bound.Frontend_Latency:                         26.55 +-     0.00 % Slots                
FE             Frontend_Bound.Frontend_Bandwidth:                       10.62 +-     0.00 % Slots                
BAD            Bad_Speculation.Branch_Mispredicts:                      23.72 +-     0.00 % Slots                
BAD            Bad_Speculation.Machine_Clears:                           0.01 +-     0.00 % Slots below          
BE/Mem         Backend_Bound.Memory_Bound:                               1.59 +-     0.00 % Slots below          
BE/Core        Backend_Bound.Core_Bound:                                 5.73 +-     0.00 % Slots below          
RET            Retiring.Base:                                           31.54 +-     0.00 % Slots below          
RET            Retiring.Microcode_Sequencer:                             0.28 +-     0.00 % Slots below          
FE             Frontend_Bound.Frontend_Latency.ICache_Misses:            0.70 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.ITLB_Misses:              0.62 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:          5.04 +-     0.00 % Clocks_Estimated      <==
FE             Frontend_Bound.Frontend_Latency.DSB_Switches:             0.57 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.LCP:                      0.00 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.MS_Switches:              0.76 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Bandwidth.MITE:                   0.36 +-     0.00 % CoreClocks below     
FE             Frontend_Bound.Frontend_Bandwidth.DSB:                   26.79 +-     0.00 % CoreClocks below     
FE             Frontend_Bound.Frontend_Bandwidth.LSD:                    0.00 +-     0.00 % CoreClocks below     
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                      6.53 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.L2_Bound:                     -0.03 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound:                      0.37 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound:                    2.46 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                   0.22 +-     0.00 % Stalls below         
BE/Core        Backend_Bound.Core_Bound.Divider:                         0.01 +-     0.00 % Clocks below         
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:              28.53 +-     0.00 % Clocks below         
RET            Retiring.Base.FP_Arith:                                   0.02 +-     0.00 % Uops below           
RET            Retiring.Base.Other:                                     99.98 +-     0.00 % Uops below           
RET            Retiring.Microcode_Sequencer.Assists:                     0.00 +-     0.00 % Slots_Estimated below
               MUX:                                                    100.00 +-     0.00 %                      
warning: 6 results not referenced: 67 71 72 85 87 88

此二进制文件需要大约 4.7 秒才能运行。

如果我将以下标志添加到 gcc：-falign-loops=32，二进制文件现在需要大约 3.8 秒才能达到运行，这是 toplev 的输出：

FE             Frontend_Bound:                                          17.47 +-     0.00 % Slots below           
BAD            Bad_Speculation:                                         28.55 +-     0.00 % Slots                 
BE             Backend_Bound:                                           12.02 +-     0.00 % Slots                 
RET            Retiring:                                                34.21 +-     0.00 % Slots below           
FE             Frontend_Bound.Frontend_Latency:                          6.10 +-     0.00 % Slots below           
FE             Frontend_Bound.Frontend_Bandwidth:                       11.31 +-     0.00 % Slots below           
BAD            Bad_Speculation.Branch_Mispredicts:                      29.19 +-     0.00 % Slots                  <==
BAD            Bad_Speculation.Machine_Clears:                           0.01 +-     0.00 % Slots below           
BE/Mem         Backend_Bound.Memory_Bound:                               4.58 +-     0.00 % Slots below           
BE/Core        Backend_Bound.Core_Bound:                                 7.44 +-     0.00 % Slots below           
RET            Retiring.Base:                                           33.70 +-     0.00 % Slots below           
RET            Retiring.Microcode_Sequencer:                             0.50 +-     0.00 % Slots below           
FE             Frontend_Bound.Frontend_Latency.ICache_Misses:            0.55 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.ITLB_Misses:              0.58 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:          5.72 +-     0.00 % Clocks_Estimated below
FE             Frontend_Bound.Frontend_Latency.DSB_Switches:             0.17 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.LCP:                      0.00 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.MS_Switches:              0.40 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Bandwidth.MITE:                   0.68 +-     0.00 % CoreClocks below      
FE             Frontend_Bound.Frontend_Bandwidth.DSB:                   42.01 +-     0.00 % CoreClocks below      
FE             Frontend_Bound.Frontend_Bandwidth.LSD:                    0.00 +-     0.00 % CoreClocks below      
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                      7.60 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.L2_Bound:                     -0.04 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound:                      0.70 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound:                    0.71 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                   1.85 +-     0.00 % Stalls below          
BE/Core        Backend_Bound.Core_Bound.Divider:                         0.02 +-     0.00 % Clocks below          
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:              17.38 +-     0.00 % Clocks below          
RET            Retiring.Base.FP_Arith:                                   0.02 +-     0.00 % Uops below            
RET            Retiring.Base.Other:                                     99.98 +-     0.00 % Uops below            
RET            Retiring.Microcode_Sequencer.Assists:                     0.00 +-     0.00 % Slots_Estimated below 
               MUX:                                                    100.00 +-     0.00 %                       
warning: 6 results not referenced: 67 71 72 85 87 88

通过添加该标志，前端延迟得到改善（正如我们从 toplev 输出中看到的那样）。我知道通过添加该标志，现在循环对齐到 32 字节，并且当运行紧循环时 DSB 被更频繁地命中（代码大部分时间花在几个小循环中）。但是我不明白为什么指标 Frontend_Bound.Frontend_Bandwidth.DSB 上升了（该指标的描述是：“该指标代表 CPU 周期的核心分数可能由于 DSB（解码的 uop 缓存）获取而受到限制 pipeline")。我原以为该指标会下降，因为 DSB 的使用正是我通过添加 gcc 标志而改进的地方。

PS：当运行ning toplev 我使用 --no-multiplex 来最小化由多路复用引起的错误。目标架构是 Broadwell，循环的汇编如下（Intel 语法）：

 606:   eb 15                   jmp    61d <main+0x7d>
 608:   0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
 60f:   00 
 610:   48 83 c6 01             add    rsi,0x1
 614:   48 81 fe 01 20 00 00    cmp    rsi,0x2001
 61b:   74 ad                   je     5ca <main+0x2a>
 61d:   41 80 3c 30 00          cmp    BYTE PTR [r8+rsi*1],0x0
 622:   74 ec                   je     610 <main+0x70>
 624:   48 8d 0c 36             lea    rcx,[rsi+rsi*1]
 628:   48 81 f9 00 20 00 00    cmp    rcx,0x2000
 62f:   77 20                   ja     651 <main+0xb1>
 631:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
 636:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
 63d:   00 00 00 
 640:   41 c6 04 08 00          mov    BYTE PTR [r8+rcx*1],0x0
 645:   48 01 f1                add    rcx,rsi
 648:   48 81 f9 00 20 00 00    cmp    rcx,0x2000
 64f:   7e ef                   jle    640 <main+0xa0>

Answer 1

您的汇编代码揭示了为什么带宽 DSB 指标非常高（即，在 DSB 处于活动状态的所有核心周期的 42.01% 中，DSB 提供的速度小于 4 微指令）。该问题似乎存在于以下循环中：

 610:   48 83 c6 01             add    rsi,0x1
 614:   48 81 fe 01 20 00 00    cmp    rsi,0x2001
 61b:   74 ad                   je     5ca <main+0x2a>
 61d:   41 80 3c 30 00          cmp    BYTE PTR [r8+rsi*1],0x0
 622:   74 ec                   je     610 <main+0x70>

尽管将 -falign-loops=32 传递给编译器，但此循环在 16 字节边界上对齐。此外，最后一条指令跨越 32 字节边界，这意味着它将存储在 DSB 中的不同缓存集中。 DSB 只能在同一周期内从一个集合向 IDQ 传送微指令。因此它将在一个周期中交付 add 和 cmp/je，在下一个周期中交付第二个 cmp/je。在这两个周期中，DSB 带宽都小于 4 微码。

然而，LSD 应该隐藏这些限制。但是好像不活跃。该循环包含两个跳转指令。第一个似乎检查是否已达到数组的大小（0x2001 字节），第二个似乎检查是否已达到非零字节宽的元素。最大行程计数 0x2001 为 LSD 提供了充足的时间来检测环路并将其锁定在 IDQ 中。另一方面，如果在 LSD 检测到循环之前找到非零元素的概率，则微指令将从 DSB 路径或 MITE 路径传递。在这种情况下，它们似乎是从 DSB 路径传送的。并且由于循环体跨越 32 字节边界，执行一次迭代需要 2 个周期（相比之下，如果循环是 32 字节对齐的，则最多需要一个周期，因为 Broadwell 上有两个跳转执行端口）。我认为如果将此循环对齐到 32 字节，带宽 DSB 指标将会提高，不是因为 DSB 将在每个周期提供 4 微指令（每个周期仅提供 3 微指令），而是因为它可能需要更少的周期来执行循环。

即使您以某种方式更改了代码，以便从 LSD 传递微指令，您仍然不能在每次迭代中完成 1 个周期，尽管 Broadwell 中的 LSD 可以跨循环迭代传递微指令（与我认为的 DSB 形成对比）。那是因为你会遇到另一个瓶颈：一个周期内最多可以分配两次跳转（参见：）。因此，带宽 LSD 指标会变大，而带宽 DSB 指标会变小。这只是改变了瓶颈，并没有提高性能（虽然可能会提高功耗）。除了将工作从某个地方转移到循环之外，没有其他方法可以提高此循环的前端带宽。

有关 LSD 的信息，请参阅。

难以理解和比较 CPU 性能指标

Trouble understanding and comparing CPU performance metrics

performance

x86

intel

compiler-optimization

perf