Skylake 一个周期可以执行多少个 1-byte NOP

Question

我将分支目标与 NOP 对齐，有时 CPU 会执行这些 NOP，最多 15 个 NOP。 Skylake在一个周期内可以执行多少个1字节的NOP？其他与 Intel 兼容的处理器（如 AMD）呢？我不仅对 Skylake 感兴趣，对其他微架构也很感兴趣。执行一个包含 15 个 NOP 的序列需要多少个周期？我想知道添加这些 NOP 的额外代码大小和额外执行时间是否物有所值。这不是我添加这些 NOP，而是每当我编写 align 指令时自动添加汇编器。

更新： 我已经设法让汇编程序自动插入多字节 NOPs。

Answer 1

This is not me who adding these NOPs but an assembler. It is pretty dumb and do not support options (BASM) for alignment - there is just one option - boundary size.

我不知道 "BASM" 是什么，我在网上也找不到任何参考资料（除了 this, which obviously isn't x86), but if it doesn't support multi-byte NOPs, you really need a different assembler. This is just really basic stuff that's been in the Intel and AMD architecture manuals for years. The Gnu assembler can do this for ALIGN directives, and so can Microsoft's MASM. The open-source NASM and YASM 汇编器也支持这个，而且它们中的任何一个都可以集成到任何现有的构建系统都很容易。

多字节 NOP 是指以下内容，您可以在 AMD 和 Intel 处理器手册中找到：

Length   |  Mnemonic                                 |  Opcode Bytes
---------|-------------------------------------------|-------------------------------------
1 byte   |  NOP                                      |  90
2 bytes  |  66 NOP                                   |  66 90
3 bytes  |  NOP DWORD [EAX]                          |  0F 1F 00
4 bytes  |  NOP DWORD [EAX + 00H]                    |  0F 1F 40 00
5 bytes  |  NOP DWORD [EAX + EAX*1 + 00H]            |  0F 1F 44 00 00
6 bytes  |  66 NOP DWORD [EAX + EAX*1 + 00H]         |  66 0F 1F 44 00 00
7 bytes  |  NOP DWORD [EAX + 00000000H]              |  0F 1F 80 00 00 00 00
8 bytes  |  NOP DWORD [EAX + EAX*1 + 00000000H]      |  0F 1F 84 00 00 00 00 00
9 bytes  |  66 NOP DWORD [EAX + EAX*1 + 00000000H]   |  66 0F 1F 84 00 00 00 00 00

两家制造商提供的序列建议在 9 个字节后略有不同，但这么长的 NOP 是……不是很常见。并且可能无关紧要，因为带有过多前缀的极长 NOP 指令无论如何都会降低性能。这些在 Pentium Pro 上一直有效，因此今天普遍支持它们。

Agner Fog 对多字节 NOP 有这样的说法：

The multi-byte NOP instruction has the opcode 0F 1F + a dummy memory operand. The length of the multi-byte NOP instruction can be adjusted by optionally adding 1 or 4 bytes of displacement and a SIB byte to the dummy memory operand and by adding one or more 66H prefixes. An excessive number of prefixes can cause delay on older microprocessors, but at least two prefixes is acceptable on most processors. NOPs of any length up to 10 bytes can be constructed in this way with no more than two prefixes. If the processor can handle multiple prefixes without penalty then the length can be up to 15 bytes.

所有 redundant/superfluous 前缀都将被忽略。当然，优势在于许多较新的处理器对多字节 NOP 的解码率较低，从而使它们更加高效。它们将比一系列单字节 NOP (0x90) 指令更快。

也许比多字节 NOP 更好的对齐方式是使用您已经在代码中使用的更长形式的指令。这些更长的编码不会花费更长的时间来执行（它们只影响解码带宽），所以它们比 NOPs faster/cheaper。这方面的例子是：

使用 mod-reg-r/m 字节形式的指令，如 INC、DEC、PUSH、POP 等，而不是短版本
使用更长的等效指令，例如 ADD 而不是 INC 或 LEA 而不是 MOV。
对更长形式的立即数进行编码（例如，32 位立即数而不是符号扩展的 8 位立即数）
添加 SIB 字节and/or 不必要的前缀（例如，长模式下的操作数大小、段和 REX）

Agner Fog 的手册详细介绍了这些技术并给出了示例。

我不知道有哪个汇编器会自动为你做这些 conversions/optimizations（汇编器会选择最短的版本，原因很明显），但它们通常有一个严格的模式，你可以在其中强制执行特定的要使用的编码，或者您可以手动发出指令字节。无论如何，您只能在对性能高度敏感的代码中执行此操作，而这些工作实际上会得到回报，因此这大大限制了所需工作的范围。

I want to know whether extra code size and extra execution time of adding these NOPs worth its price.

一般来说，不会。虽然数据对齐非常重要并且基本上是免费的（尽管二进制文件的大小），但代码对齐的重要性要小得多。在某些情况下，在紧密循环中它会产生显着差异，但这只对代码中的热点很重要，您的探查器已经识别出这些热点，然后您可以执行操作以在必要时手动对齐代码。不然我也不着急。

对齐函数是有意义的，因为它们之间的填充字节永远不会执行（而不是在这里使用 NOP，你会经常看到 INT 3 或无效指令，如 UD2），但我不会理所当然地在函数内对齐所有分支目标 。仅在已知的关键内部循环中执行此操作。

一如既往，Agner Fog 谈到了这个，而且比我说得更好：

Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an important subroutine entry or jump label happens to be near the end of a 16-byte block then the microprocessor will only get a few useful bytes of code when fetching that block of code. It may have to fetch the next 16 bytes too before it can decode the first instructions after the label. This can be avoided by aligning important subroutine entries and loop entries by 16. Aligning by 8 will assure that at least 8 bytes of code can be loaded with the first instruction fetch, which may be sufficient if the instructions are small. We may align subroutine entries by the cache line size (typically 64 bytes) if the subroutine is part of a critical hot spot and the preceding code is unlikely to be executed in the same context.

A disadvantage of code alignment is that some cache space is lost to empty spaces before the aligned code entries.

In most cases, the effect of code alignment is minimal. So my recommendation is to align code only in the most critical cases like critical subroutines and critical innermost loops.

Aligning a subroutine entry is as simple as putting as many NOP's as needed before the subroutine entry to make the address divisible by 8, 16, 32 or 64, as desired. The assembler does this with the ALIGN directive. The NOP's that are inserted will not slow down the performance because they are never executed.

It is more problematic to align a loop entry because the preceding code is also executed. It may require up to 15 NOP's to align a loop entry by 16. These NOP's will be executed before the loop is entered and this will cost processor time. It is more efficient to use longer instructions that do nothing than to use a lot of single-byte NOP's. The best modern assemblers will do just that and use instructions like MOV EAX,EAX and LEA EBX,[EBX+00000000H] to fill the space before an ALIGN nn statement. The LEA instruction is particularly flexible. It is possible to give an instruction like LEA EBX,[EBX] any length from 2 to 8 by variously adding a SIB byte, a segment prefix and an offset of one or four bytes of zero. Don't use a two-byte offset in 32-bit mode as this will slow down decoding. And don't use more than one prefix because this will slow down decoding on older Intel processors.

Using pseudo-NOPs such as MOV RAX,RAX and LEA RBX,[RBX+0] as fillers has the disadvantage that it has a false dependence on the register, and it uses execution resources. It is better to use the multi-byte NOP instruction which can be adjusted to the desired length. The multi-byte NOP instruction is available in all processors that support conditional move instructions, i.e. Intel PPro, P2, AMD Athlon, K7 and later.

An alternative way of aligning a loop entry is to code the preceding instructions in ways that are longer than necessary. In most cases, this will not add to the execution time, but possibly to the instruction fetch time.

他还继续展示了另一种通过移动前面的子例程条目来对齐内部循环的方法的示例。这有点尴尬，即使在最好的组装机中也需要一些手动调整，但这可能是最佳机制。同样，这仅在热路径上的关键内部循环中很重要，无论如何您可能已经在其中进行挖掘和微优化。

有趣的是，我已经对处于优化过程中的代码进行了多次基准测试，但没有发现对齐循环分支目标有什么好处。例如，我正在编写一个优化的 strlen 函数（Gnu 库有一个，但 Microsoft 没有），并尝试在 8 字节、16 字节和 32 字节上对齐主内循环的目标界限。 None 其中有很大的不同，尤其是与我在重写代码时取得的其他显着性能进步相比时。

请注意，如果您没有针对特定处理器进行优化，您可能会疯狂地寻找最佳 "generic" 代码。说到对齐对速度的影响，things can vary wildly。糟糕的对齐策略通常比根本没有对齐策略更糟糕。

2 的幂边界始终是一个好主意，但这很容易实现，无需任何额外的努力。再次重申，不要忽视对齐，因为它可能很重要，但出于同样的原因，不要着迷于尝试对齐每个分支目标。

在最初的 Core 2（Penryn 和 Nehalem）微架构上，对齐曾经是一个更大的问题，其中大量的解码瓶颈意味着，尽管有 4 宽的问题宽度，但你很难让它的执行单元保持忙碌.随着在 Sandy Bridge 中引入 µop 缓存（Pentium 4 的几个不错的特性之一，最终重新引入 P6 扩展系列），前端吞吐量显着增加，这变得不那么重要了问题。

坦率地说，编译器也不擅长进行这些类型的优化。 GCC 的 -O2 开关意味着 -falign-functions、-falign-jumps、-falign-loops 和 -falign-labels 开关，默认首选项是在 8 字节边界上对齐。这是一种非常直截了当的方法，而且效果各不相同。正如我在上面链接的那样，关于禁用此对齐并使用紧凑代码是否真的可以提高性能的报告各不相同。此外，您将看到编译器所做的最好的事情就是插入多字节 NOP。我还没有看到使用更长形式的指令或为了对齐目的而彻底重新排列代码的指令。所以我们还有很长的路要走，这是一个非常很难解决的问题。 Some people are working on it，但这只是表明问题实际上是多么棘手："Small changes in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the effect of exposing compiler and performance optimization efforts to perceived unwanted randomness."（请注意，虽然有趣，但该论文来自早期的 Core 2 天，正如我之前提到的，它比大多数人遭受更多的错位惩罚。我不确定你是否会在今天的微体系结构上看到同样显着的改进，但我不能肯定地说，因为我没有运行测试。也许 Google 会雇用我，我可以发表另一篇论文？）

How many 1-byte NOPs can Skylake execute in one cycle? What about other Intel-compatible processors, like AMD? I'm interested not only in Skylake but in other microarchitecrutes as well. How many cycles may it take to execute a sequence of 15 NOPs?

类似这样的问题可以通过查看 Agner Fog 的 instruction tables 并搜索 NOP 来回答。我不会费心将他的所有数据提取到这个答案中。

不过，总的来说，只知道 NOP 不是免费的。尽管它们不需要执行 unit/port，但它们仍然必须像任何其他指令一样通过管道运行，因此它们最终会受到问题（and/or 退役）宽度的瓶颈处理器。这通常意味着您每个时钟可以执行 3 到 5 个 NOP。

NOP 还在 µop 缓存中占用 space，这意味着代码密度和缓存效率降低。

在许多方面，您可以将 NOP 视为等同于 XOR reg, reg 或 MOV，后者由于寄存器重命名而在前端被删除。

Answer 2

另请参阅 Cody 的回答，因为他已经涵盖了很多我遗漏的好东西。

切勿使用多个 1 字节 NOP。所有汇编程序都有办法获得长 NOP；见下文。

15 NOP 需要 3.75c 以通常的每时钟 4 次发出，但如果此时它在长依赖链上遇到瓶颈，则可能根本不会减慢您的代码。他们确实在 ROB 中占据 space 一直到退休。他们唯一不做的就是使用执行端口。关键是，CPU 性能不是累加的。你不能只说 "this takes 5 cycles and this takes 3, so together they will take 8"。乱序执行的要点就是和周围代码重叠

许多 1 字节短 NOP 对 SnB 系列的更坏影响是它们往往会溢出每个对齐的 32B 块 x86 代码 3 行的 uop 缓存限制。这意味着整个 32B 块总是必须运行来自解码器，而不是 uop 缓存或循环缓冲区。（循环缓冲区仅适用于在 uop 缓存中具有所有 uops 的循环）。

你应该只在一行中最多有 2 个 NOP 实际执行，然后只有当你需要填充超过 10B 或 15B 或其他东西时。（一些 CPUs 在解码带有很多前缀的指令时表现非常糟糕，因此对于实际执行的 NOP，最好不要将前缀重复到 15B（最大 x86 指令长度）。

YASM 默认生成长 NOP。对于 NASM，使用默认情况下未启用的 the smartalign standard macro package。它迫使你选择一个 NOP 策略。

%use smartalign
ALIGNMODE p6, 32     ;  p6 NOP strategy, and jump over the NOPs only if they're 32B or larger.

IDK 如果 32 是最佳的。此外，请注意最长的 NOP 可能会使用大量前缀并在 Silvermont 或 AMD 上解码缓慢。查看 NASM 手册了解其他模式。

GNU 汇编程序的 .p2align 指令为您提供一些条件行为：.p2align 4,,10 将对齐到 16 (1<<4)，但前提是跳过 10 个字节或更少。（空的第二个参数表示填充符是 NOP，而 2 的幂对齐名称是因为普通 .align 在某些平台上是 2 的幂，但在其他平台上是字节数）。 gcc 经常在循环顶部之前发出这个：

  .p2align 4,,10 
  .p2align 3
.L7:

所以你总是得到 8 字节对齐（无条件 .p2align 3），但也可能是 16，除非那样会浪费超过 10B。将较大的对齐放在第一位对于避免获得例如很重要。一个 1 字节的 NOP，然后是一个 8 字节的 NOP，而不是一个 9 字节的 NOP。

可能可以使用 NASM 宏来实现此功能。

汇编器没有的缺失功能（AFAIK）：

一个指令通过使用更长的编码（例如 imm32 而不是 imm8 或不需要的 REX 前缀）来填充前面的指令，以在没有 NOP 的情况下实现所需的对齐。
基于后续指令长度的智能条件，例如如果在到达下一个 16B 或 32B 边界之前可以解码 4 条指令则不填充。

解码瓶颈的对齐通常不再很重要，这是一件好事，因为调整它通常涉及手动 assemble/disassemble/edit 周期，如果前面的代码发生变化，则必须再次查看。

特别是如果您有能力对有限的 CPU 集进行调优，请进行测试，如果您没有发现性能优势，请不要填充。在很多情况下，特别是对于具有 uop 缓存 and/or 循环缓冲区的 CPUs，不在函数内对齐分支目标是可以的，甚至是循环。

由于不同的对齐方式导致的一些性能变化是它使不同的分支在分支预测缓存中相互别名。即使在uop 缓存工作完美，从 uop 缓存中获取大部分空行没有前端瓶颈。

另见 Performance optimisations of x86-64 assembly - Alignment and branch prediction

Answer 3

Skylake一般可以在一个周期内执行四个single-byte个nops。至少回到 Sandy Bridge（以下简称 SnB）是这样 micro-architecture。

Skylake 和其他回到 SnB 的人通常也可以在一个周期内执行四个 longer-than-one-byte nops，除非它们长到运行进入 front-end 限制。

^{现有答案更完整，并解释了为什么您可能不想使用这样的 single-byte nop 说明，所以我不会添加更多，但很高兴我认为有一个答案可以清楚地回答标题问题。}

Skylake 一个周期可以执行多少个 1-byte NOP

How many 1-byte NOPs can Skylake execute at one cycle

optimization

x86

assembly

alignment

nop