由于 FPU 或缓存,double 是否与 8 字节边界对齐?

Is a double aligned to an 8 bytes boundary because of the FPU or because of the cache?

我想了解为什么双精度对齐到 8 字节边界而不是仅对齐到 4 字节边界。在这个 article 中它说:

  1. When memory reading is efficient in reading 4 bytes at a time on 32 bit machine, why should a double type be aligned on 8 byte boundary?

It is important to note that most of the processors will have math co-processor, called Floating Point Unit (FPU). Any floating point operation in the code will be translated into FPU instructions. The main processor is nothing to do with floating point execution. All this will be done behind the scenes.

As per standard, double type will occupy 8 bytes. And, every floating point operation performed in FPU will be of 64 bit length. Even float types will be promoted to 64 bit prior to execution.

The 64 bit length of FPU registers forces double type to be allocated on 8 byte boundary. I am assuming (I don’t have concrete information) in case of FPU operations, data fetch might be different, I mean the data bus, since it goes to FPU. Hence, the address decoding will be different for double types (which is expected to be on 8 byte boundary). It means, the address decoding circuits of floating point unit will not have last 3 pins.

虽然在这个 SO question 中它说:

The reason to align a data value of size 2^N on a boundary of 2^N is to avoid the possibility that the value will be split across a cache line boundary.

The x86-32 processor can fetch a double from any word boundary (8 byte aligned or not) in at most two, 32-bit memory reads. But if the value is split across a cache line boundary, then the time to fetch the 2nd word may be quite long because of the need to fetch a 2nd cache line from memory. This produces poor processor performance unnecessarily. (As a practical matter, the current processors don't fetch 32-bits from the memory at a time; they tend to fetch much bigger values on much wider busses to enable really high data bandwidths; the actual time to fetch both words if they are in the same cache line, and already cached, may be just 1 clock).

A free consequence of this alignment scheme is that such values also do not cross page boundaries. This avoids the possibility of a page fault in the middle of an data fetch.

So, you should align doubles on 8 byte boundaries for performance reasons. And the compilers know this and just do it for you.

那么正确答案是哪一个?两者都有吗?

It is important to note that most of the processors will have math co-processor, called Floating Point Unit (FPU).

所以,首先,这篇文章有些错误。处理器中不再有真正的 FPU,因为算术指令基本上是在相同的指令流水线等中处理的。

The main processor is nothing to do with floating point execution.

这是 2015 年,我们不是在谈论 Intel 486,所以这是完全错误的。

As per standard, double type will occupy 8 bytes. And, every floating point operation performed in FPU will be of 64 bit length. Even float types will be promoted to 64 bit prior to execution.

据我所知,这绝不是真的;有适用于单精度浮点数的指令,也有适用于双精度的指令。

The 64 bit length of FPU registers forces double type to be allocated on 8 byte boundary.

这根本不是真的。有些指令只能与特殊对齐的内存一起使用,有些指令速度更快,但这通常取决于它们的规范或各自的实现。特定操作所需的周期之类的东西在处理器世代之间发生变化!

所以,SO 答案是正确的。相信你的编译器。如果你想对齐内存(即,对于你希望编译器使用 SIMD 指令的浮点数数组等),那么就有 posix_memalign 之类的东西(当然,在 unix 下,但我可以想象 posix WindowsNT 中的层,后来也实现了),可以为您提供很好的对齐内存。

一般来说,内存对齐问题大多被内存单元隐藏了——执行单元会接收到正确旋转且大小正确的数据(同样的问题也可能适用于整数类型)。

因此,对齐主要与缓存此数据的能力有关,而不必担心必须分段获取数据(拆分获取),这是一项会引发各种一致性和原子性问题的棘手业务。

如果某些架构想要节省旋转逻辑并强制您相应地对齐一些数据,这当然可能会改变,但总的来说这是一个更容易解决的问题,因此出于硬件考虑限制架构有点毫无意义(至少现在是这样)。