现代 CPU 保持标志更新是否需要大量资源?

Does it cost significant resources for a modern CPU to keep flags updated?

据我所知,在现代故障 CPU 中,最昂贵的东西之一是状态,因为必须在多个版本中跟踪该状态,并在多个版本中保持最新说明等

x86 和 ARM 等一些指令集广泛使用了标志,这些标志是在成本模型不是今天的情况下引入的,标志只需要几个逻辑门。诸如每条算术指令设置标志以检测零、进位和溢出之类的事情。

更新现代无序实施是否特别昂贵?这样例如ADD 指令更新了进位标志,并且必须对其进行跟踪,因为尽管它 可能 永远不会被使用,但 可能 某些其他指令可能稍后使用它 N 条指令,N 上没有固定的上限?

在没有这些标志的 MIPS 等指令集架构上,加法和减法等整数运算是否更便宜?

这方面的各个方面都不是很为人所知,所以我会尽量将明确知道的事情与合理的猜测和推测分开。

一种方法是扩展(物理)整数寄存器(无论它们采用物理寄存器文件 [例如 P4 和 SandyBridge+] 的形式,还是采用 results-in-ROB [例如 P3] 的形式)由也产生相关整数结果的操作产生。这只是关于算术标志(有时是 AFLAGS,不要与 EFLAGS 混淆),但我认为“奇怪的标志”不是这个问题的重点。有趣的是有一项专利[1] that hints at storing more than just the 6 AFLAGS themselves, putting some "combination flags" in there as well, but who know whether that was really done - most sources say the registers are extended by 6 bits, but AFAIK we (the public) don't really know. Lumping the integer result and associated flags together is described in for example this patent[2], which is primarily about preventing a certain situation where the flags might accidentally no longer be backed by any physical register. Aside from such quirks, during normal operation it has the nice effect of only needing to allocate 1 register for an arithmetic operation, rather than a separate main-result and flags-result, so renaming is normally not made much worse by the existence of the flags. Additionally, either the register alias table needs at least one more slot to keep track of which integer register contains the latest flags, or a separate flag-renaming-state buffer keeps track of the latest speculative flag state ([2] suggests Intel chose to separate them, which may simplify the main RAT but they don't go into such details). More slots may be used[3] to efficiently implement instructions which only update a subset of the flags (NetBurst™ famously lacked this, resulting in the now-stale advice to favour add over inc)。类似地,非推测性架构状态(是否是 退休寄存器文件的一部分 还是单独但又相似尚不清楚)至少需要一个这样的插槽。

另一个问题是首先计算标志。 [1] suggests separating flag generation from the main ALU simplifies the design. It's not clear to what degree they would be separated: the main ALU has to compute the Adjust and Sign flags anyway, and having an adder output a carry out the top is not much to ask (less than recomputing it from nothing). The overflow flag only takes an extra XOR gate to combine the carry into the top bit with the carry out of the top bit. The Zero flag and Parity flag are not for free though (and they depend on the result, not on the calculation of the result), if there is partial separation it would make sense that those would be computed separately. Perhaps it really is all separate. In NetBurst™, flag calculation took an extra half-cycle (the ALU was double-pumped and staggered)[4], but whether that means all flags are computed separately or a subset of them (or even a superset as [1] 暗示)不清楚 - 标志结果被视为整体,因此延迟测试无法区分标志是由标志单元在第三个半周期中计算出来的,还是只是由 ALU 传递给标志单元的。在任何情况下,典型的 ALU 操作都可以背靠背执行,即使依赖(意味着第一个操作的高半部分和第二个操作的低半部分 运行 并行),延迟计算旗帜并没有阻碍这一点。正如您所预料的那样,ADCSBB 在 NetBurst 上效率不高,但也可能有其他原因(由于某些原因涉及很多微操作)。

总的来说,我会得出结论,算术标志的存在会花费大量的工程资源来防止它们对性能产生重大影响,但这种努力也是有效的,因此避免了重大影响。