在 Clang 中可移植且最佳地将 uint64_t 转换为字节数组

Question

如果您想将 uint64_t 转换为 uint8_t[8]（小端）。在小端架构上，你可以做一个丑陋的 reinterpret_cast<> 或 memcpy()，例如：

void from_memcpy(const std::uint64_t &x, uint8_t* bytes) {
    std::memcpy(bytes, &x, sizeof(x));
}

这会产生高效的装配：

mov     rax, qword ptr [rdi]
mov     qword ptr [rsi], rax
ret

但是它不可移植。它在小端机器上会有不同的行为。

要将 uint8_t[8] 转换为 uint64_t，有一个很好的解决方案 - 只需这样做：

void to(const std::uint8_t* bytes, std::uint64_t &x) {
    x = (std::uint64_t(bytes[0]) << 8*0) |
        (std::uint64_t(bytes[1]) << 8*1) |
        (std::uint64_t(bytes[2]) << 8*2) |
        (std::uint64_t(bytes[3]) << 8*3) |
        (std::uint64_t(bytes[4]) << 8*4) |
        (std::uint64_t(bytes[5]) << 8*5) |
        (std::uint64_t(bytes[6]) << 8*6) |
        (std::uint64_t(bytes[7]) << 8*7);
}

这看起来效率低下，但实际上使用 Clang -O2 它会生成与以前完全相同的程序集，如果您在大端机器上编译，它会足够聪明，可以使用本机字节交换指令。例如。此代码：

void to(const std::uint8_t* bytes, std::uint64_t &x) {
    x = (std::uint64_t(bytes[7]) << 8*0) |
        (std::uint64_t(bytes[6]) << 8*1) |
        (std::uint64_t(bytes[5]) << 8*2) |
        (std::uint64_t(bytes[4]) << 8*3) |
        (std::uint64_t(bytes[3]) << 8*4) |
        (std::uint64_t(bytes[2]) << 8*5) |
        (std::uint64_t(bytes[1]) << 8*6) |
        (std::uint64_t(bytes[0]) << 8*7);
}

编译为：

mov     rax, qword ptr [rdi]
bswap   rax
mov     qword ptr [rsi], rax
ret

我的问题是：是否有一个等效的可靠优化构造用于在相反方向上进行转换？我试过这个，但它被天真地编译了：

void from(const std::uint64_t &x, uint8_t* bytes) {
    bytes[0] = x >> 8*0;
    bytes[1] = x >> 8*1;
    bytes[2] = x >> 8*2;
    bytes[3] = x >> 8*3;
    bytes[4] = x >> 8*4;
    bytes[5] = x >> 8*5;
    bytes[6] = x >> 8*6;
    bytes[7] = x >> 8*7;
}

编辑： 经过一些试验，只要您使用 uint8_t* __restrict__ bytes，这段代码确实可以在 GCC 8.1 及更高版本中得到最佳编译。然而，我仍然没有设法找到 Clang 将优化的形式。

Answer 1

以下是我根据 OP 评论中的讨论可以测试的内容：

void from_optimized(const std::uint64_t &x, std::uint8_t* bytes) {
    std::uint64_t big;
    std::uint8_t* temp = (std::uint8_t*)&big;
    temp[0] = x >> 8*0;
    temp[1] = x >> 8*1;
    temp[2] = x >> 8*2;
    temp[3] = x >> 8*3;
    temp[4] = x >> 8*4;
    temp[5] = x >> 8*5;
    temp[6] = x >> 8*6;
    temp[7] = x >> 8*7;
    std::uint64_t* dest = (std::uint64_t*)bytes;
    *dest = big;
}

看起来这会让编译器更清楚，并让它假定必要的参数来优化它（在 GCC 和 Clang 上都使用 -O2）。

在 Clang 8.0.0 (test on Godbolt) 上编译为 x86-64（小端）：

mov     rax, qword ptr [rdi]
mov     qword ptr [rsi], rax
ret

在 Clang 8.0.0 (test on Godbolt) 上编译为 aarch64_be（大端）：

ldr     x8, [x0]
rev     x8, x8
str     x8, [x1]
ret

Answer 2

首先，您原来的 from 实现无法优化的原因是您通过引用和指针传递参数。因此，编译器必须考虑它们都指向同一个地址（或者至少它们重叠）的可能性。由于您对（可能）相同的地址进行了 8 次连续的读写操作，因此无法在此处应用 as-if rule。

请注意，仅通过从函数签名中删除 &，显然 GCC already considers this as proof that bytes does not point into x and thus this can safely be optimized. However, for Clang this is not good enough。从技术上讲，当然 bytes 可以指向 from 的堆栈内存（又名 x），但我认为这将是未定义的行为，因此 Clang 错过了这个优化。

您对 to 的实现不会遇到此问题，因为您的实现方式是首先您读取了 [=18] 的所有值=] 和然后你给 x 做了一个大任务。因此，即使 x 和 bytes 指向相同的地址，因为您先进行所有读取，然后再进行所有写入（而不是像在 from 中那样混合读取和写入），这可以优化。

之所以有效，是因为它正是这样做的：它首先读取所有值，然后才写入目标。

但是，实现这一点的方法并不复杂：

void from(uint64_t x, uint8_t* dest) {
    uint8_t bytes[8];
    bytes[7] = uint8_t(x >> 8*7);
    bytes[6] = uint8_t(x >> 8*6);
    bytes[5] = uint8_t(x >> 8*5);
    bytes[4] = uint8_t(x >> 8*4);
    bytes[3] = uint8_t(x >> 8*3);
    bytes[2] = uint8_t(x >> 8*2);
    bytes[1] = uint8_t(x >> 8*1);
    bytes[0] = uint8_t(x >> 8*0);

    *(uint64_t*)dest = *(uint64_t*)bytes;
}

编译为

mov     qword ptr [rsi], rdi
ret

在小端和

rev     x8, x0
str     x8, [x1]
ret

在大端。

请注意，即使您通过引用传递 x，Clang 也能够对其进行优化。但是，这将导致每条指令多一条：

mov     rax, qword ptr [rdi]
mov     qword ptr [rsi], rax
ret

和

ldr     x8, [x0]
rev     x8, x8
str     x8, [x1]
ret

分别

另请注意，您可以使用类似的技巧改进 to 的实现：不是通过 non-const 引用传递结果，而是采用 "more natural" 方法，只 return 它来自函数：

uint64_t to(const uint8_t* bytes) {
    return
        (uint64_t(bytes[7]) << 8*7) |
        (uint64_t(bytes[6]) << 8*6) |
        (uint64_t(bytes[5]) << 8*5) |
        (uint64_t(bytes[4]) << 8*4) |
        (uint64_t(bytes[3]) << 8*3) |
        (uint64_t(bytes[2]) << 8*2) |
        (uint64_t(bytes[1]) << 8*1) |
        (uint64_t(bytes[0]) << 8*0);
}

总结：

不要通过引用传递参数。
先读，再写。

这是我能为两者找到的最佳解决方案，little endian and big endian。请注意，to 和 from 是真正的逆运算，如果一个接一个地执行，可以将其优化为 no-op。

Answer 3

您提供的代码过于复杂。您可以将其替换为：

void from(uint64_t x, uint8_t* dest) {
    x = htole64(x);
    std::memcpy(dest, &x, sizeof(x));
}

是的，这使用 Linux-ism htole64()，但如果您在另一个平台上，您可以轻松地重新实现它。

Clang 和 GCC 在小平台和 big-endian 平台上都对此进行了完美优化。

Answer 4

返回一个值呢？易于推理和小型装配：

#include <cstdint>
#include <array>

auto to_bytes(std::uint64_t x)
{
    std::array<std::uint8_t, 8> b;
    b[0] = x >> 8*0;
    b[1] = x >> 8*1;
    b[2] = x >> 8*2;
    b[3] = x >> 8*3;
    b[4] = x >> 8*4;
    b[5] = x >> 8*5;
    b[6] = x >> 8*6;
    b[7] = x >> 8*7;
    return b;
}

https://godbolt.org/z/FCroX5

和大端：

#include <stdint.h>

struct mybytearray
{
    uint8_t bytes[8];
};

auto to_bytes(uint64_t x)
{
    mybytearray b;
    b.bytes[0] = x >> 8*0;
    b.bytes[1] = x >> 8*1;
    b.bytes[2] = x >> 8*2;
    b.bytes[3] = x >> 8*3;
    b.bytes[4] = x >> 8*4;
    b.bytes[5] = x >> 8*5;
    b.bytes[6] = x >> 8*6;
    b.bytes[7] = x >> 8*7;
    return b;
}

https://godbolt.org/z/WARCqN

（std::array 不适用于 -target aarch64_be？）

在 Clang 中可移植且最佳地将 uint64_t 转换为字节数组

Convert uint64_t to byte array portably and optimally in Clang

c++

clang

endianness

总结：