为什么用两个 32 位压入栈比用浮点指令（fldl 和 fstpl）压入栈慢很多？

Question

这里是一小段汇编代码（我使用的是gnu汇编器的语法）。

.extern cos
.section .data
pi: .double 3.14
.section .text
.global slowcos
.global fastcos

fastcos:
  fldl pi         
  subl , %esp   # makes some space for a double on the stack
  fstpl 0(%esp)   # copy pi on top of the stack
  call cos
  addl , %esp
  ret

slowcos:
  pushl pi+4      # push the last 4 bytes of pi on top of the stack
  pushl pi        # push the first 4 bytes of pi on top of the stack
  call cos
  addl , %esp
  retx

可以使用以下原型从 C 中轻松调用这些符号：

extern double fastcos ();
extern double slowcos ();

它们都return "cos(3.14)" 的值，但是slowcos 在intel 32 位架构上比fastcos 慢两倍。我的问题如下：

什么可以解释如此大的性能差异？

在 linux 上，您可以通过在文件调用 cos.asm 中复制此代码并调用：

来测试它

as --32 cos.asm -o cos.o 
gcc -m32 -O0 cos.o test.c -lm -o test

（如果你不是在 64 位系统上，你可以删除 --32/-m32（应该吗？）其中 test.c 是以下 C 源文件：

#include <stdio.h>
#include <time.h>

#define N 40000000

extern double fastcos ();
extern double slowcos ();

int main() {
  int k;
  double r; 
  clock_t t;

  t = clock();
  for (k = 0; k < N;k ++) 
    r = fastcos();
  printf ("%gs\n",(double) (clock() - t) / CLOCKS_PER_SEC);
  printf("fastcos = %g\n", r);

  t = clock();
  for (k = 0; k < N;k ++)
    r = slowcos();
  printf ("%gs\n",(double) (clock() - t) / CLOCKS_PER_SEC);
  printf("slowcos = %g\n", r);

  return 0;
}

在我的电脑上输出：

1.55687s
fastcos = -0.999999
2.29821s
slowcos = -0.999999

再多说一句。如果在 headers 中添加行“.global id”，请将 fastcos 和 slowcos 中的行 "call cos" 替换为 "call id" 并在 C 文件中添加以下内容 "double id (double x) { return x; }" .然后，你获得：

0.360433s
fastpi = 3.14
0.370393s
slowpi = 3.14

此代码应该在对函数 cos（或 id）的内部调用之外花费大约相同的时间。所以这应该表明差异发生在余弦函数的执行过程中。但我不明白什么可以证明这种差异是合理的。 %esp.

的对齐没有区别

最后，我想说的是，我在 real-life "numerical" 代码中观察到这些差异，其中瓶颈通常是 "elementary math functions" 的计算（如 cos 或 exp）。此外，这两个版本都是由 high-level 编程语言的编译器生成的。我主要关心的是了解那里发生了什么。

Answer 1

当现代 x86 写入内存，并且不久之后再次读取相同的内存时，它作弊以避免与 memory/cache 进行完整的往返：

Intel® 64 and IA-32 Architectures Optimization Reference Manual

2.3.4.4 Store Forwarding

If a load follows a store and reloads the data that the store writes to memory, the Intel Core microarchitecture can forward the data directly from the store to the load. This process, called store to load forwarding, saves cycles by enabling the load to obtain the data directly from the store operation instead of through memory.

文本继续关于对齐要求，但重要的是：

The store must be equal or greater in size than the size of data being loaded.

在慢速函数中，您将八字节双精度值存储在两个四字节块中。据推测，cos()-函数将它加载到单个块中，因此加载必须等到存储提交到缓存。

另一方面，在 fast 函数中，您存储了一个八字节的块，它保留在 cpu 的内部缓冲区中，从那里可以加载 cos()马上就满意了。

为什么用两个 32 位压入栈比用浮点指令（fldl 和 fstpl）压入栈慢很多？

Why pushing double on the stack with two 32s bits pushes can be a lot slower than pushing it using float instructions (fldl & fstpl)?

assembly

floating-point

double

32-bit

x87