与 memcpy 相比，为什么 GCC 为按字节复制发出更大的输出？

Question

以下 C11 程序以两种不同的方式将浮点数的位表示形式提取到 uint32_t 中。

#include <stdint.h>

_Static_assert(sizeof(float) == sizeof(uint32_t));

uint32_t f2i_char(float f) {
  uint32_t x;
  char const *src = (char const *)&f;
  char *dst = (char *)&x;
  *dst++ = *src++;
  *dst++ = *src++;
  *dst++ = *src++;
  *dst++ = *src++;
  return x;
}

uint32_t f2i_memcpy(float f) {
  uint32_t x;
  memcpy(&x, &f, sizeof(x));
  return x;
}

使用 armgcc 10.2.1 (none eabi) 编译的输出程序集非常不同，即使应用了 -Os 或 -O3 优化：

我正在编译： -mcpu=cortex-m4 -std=c11 -mfpu=fpv4-sp-d16 -mfloat-abi=hard

f2i_char:
  sub sp, sp, #16
  vstr.32 s0, [sp, #4]
  ldr r3, [sp, #4]
  strb r3, [sp, #12]
  ubfx r2, r3, #8, #8
  strb r2, [sp, #13]
  ubfx r2, r3, #16, #8
  ubfx r3, r3, #24, #8
  strb r2, [sp, #14]
  strb r3, [sp, #15]
  ldr r0, [sp, #12]
  add sp, sp, #16
  bx lr
f2i_memcpy:
  sub sp, sp, #8
  vstr.32 s0, [sp, #4]
  ldr r0, [sp, #4]
  add sp, sp, #8
  bx lr

为什么 gcc 不为这两个函数生成相同的程序集？

Godbolt example

Answer 1

Why is GCC emitting larger output with -Os than -O3 for this function on Cortex-M4?

为什么不呢？每个选项启用或禁用特定的编译器内部工作。当然，可能会有并且将会有编译器决定使 -O3 产生比 -Os.

更小的代码

Is there anything specific about the C11 standard or the Armv7E-M that's inhibiting gcc from emitting the smaller assembly at -Os?

没有

Is this gcc missing an optimization opportunity?

是的，你可以这么说。但这可能是故意的 - 可能是导致生成此类代码的优化实际上是编译时间和 CPU 消耗，因此它被禁用。就是这样。

Answer 2

避免手动复制数据。使用 memcpy。 GCC 非常了解这个函数，如果不需要，根本不会调用它。指针双关也可以打破严格的别名规则，.

在 none-eabi memcpy 中不会发出任何代码，因为 return 值在同一个寄存器中作为参数传递。无需任何操作。

https://godbolt.org/z/q8v39d737

#include <stdint.h>

_Static_assert(sizeof(float) == sizeof(uint32_t));

uint32_t f2i_char(float f) {
  uint32_t x;
  char const *src = (char const *)&f;
  char *dst = (char *)&x;
  *dst++ = *src++;
  *dst++ = *src++;
  *dst++ = *src++;
  *dst++ = *src++;
  return x;
}

uint32_t f2i1(float f) {
  uint32_t x;
  memcpy(&x, &f, sizeof(x));
  return x;
}

f2i_char:
        sub     sp, sp, #8
        ubfx    r1, r0, #8, #8
        ubfx    r2, r0, #16, #8
        ubfx    r3, r0, #24, #8
        strb    r0, [sp, #4]
        strb    r1, [sp, #5]
        strb    r2, [sp, #6]
        strb    r3, [sp, #7]
        ldr     r0, [sp, #4]
        add     sp, sp, #8
        bx      lr
f2i1:
        bx      lr

编辑：

你使用 -mfloat-abi=hard 强制在任何与浮点数相关的操作（甚至不是数学）中使用 FPU。通常，我使用 softfp 来执行硬件浮点指令和软件浮点链接。

https://gcc.godbolt.org/z/z39qnvY1c

The output assembly, compiled with armgcc 10.2.1 (none eabi) is very different, even with the -Os or -O3 optimizations applied:

您逐字节复制，编译器必须遵循您的代码。当您使用 memcpy 时，编译器会理解您的意图并且不会逐字节复制。需要额外的浮点指令，因为您使用 hard float ABI 并且 ABI 强制此操作通过内存完成（float 和 int 通过 R0 传递）。

Answer 3

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104344

GCC 无法将展开的版本识别和匹配为 bswap 或 store-merging 模式。

GCC 确实识别循环版本。

与 memcpy 相比，为什么 GCC 为按字节复制发出更大的输出？

Why is GCC emitting larger output for a bytewise copy vs memcpy?

c

gcc

arm

cortex-m