将压缩位域用于结构的非二次方字节整数成员

Question

我有一些二进制数据，其中包含定义如下的结构：

s1:
a - 1B
b - 4B
c - 2B
d - 8B

s2:
a - 1B
b - 3B
c - 2B
d - 6B

它使用 Big Endian 字节顺序。

我用下面的代码解析s1：

#include <endian.h>
#include <stdint.h>
#include <string.h>

struct [[gnu::packed]] s1 {
    uint8_t  a;
    uint32_t b;
    uint16_t c;
    uint64_t d;
};

void parse_s1(struct s1 *parsed, const unsigned char buf[static restrict sizeof(*parsed)])
{
    memcpy(parsed, buf, sizeof(*parsed));
    parsed->b = be32toh(parsed->b);
    parsed->c = be16toh(parsed->c);
    parsed->d = be64toh(parsed->d);
}

哪个 AFAIK 是定义的行为。

对于第二个结构，我正在考虑以下代码，但不确定我是否会在某个时候遇到未定义的行为（我很幸运，这些字段是字节对齐的（即没有 3-位字段）并且非二次方字节字段不连续，但完整的答案可能要考虑如果两个非二次方字节整数连续（即，如果 d 在 c) 之前）：

#include <endian.h>
#include <stdint.h>
#include <string.h>

struct [[gnu::packed]] s2 {
    uint8_t   a;
    uintmax_t b : 24;
    uint16_t  c;
    uintmax_t d : 48;
};

void parse_s2(struct s2 *parsed, const unsigned char buf[static restrict sizeof(*parsed)])
{
    memcpy(parsed, buf, sizeof(*parsed));
    parsed->b = be32toh(parsed->b) >> 8;
    parsed->c = be16toh(parsed->c);
    parsed->d = be64toh(parsed->d) >> 16;
}

我还假设这段代码是 C++ 兼容的（或者更具体地说是 gnu++ 兼容的）（函数的实际原型除外，在 C++ 中它会使用 __restrict__ 和指针而不是 VLA） .

上面的代码在 C 和 C++（GNU 方言）中都正确吗？还是依赖未定义的行为？

编辑：

以下是对@Anaconda 评论的回答，因为这在评论中的格式不正确：

//bitfields.c

#include <stdint.h>
#include <stdio.h>

struct [[gnu::packed]] s2 {
    uint8_t   a;
    uintmax_t b : 24;
    uint16_t  c;
    uintmax_t d : 48;
};

int main(void)
{
    printf("%zu\n", sizeof(struct s2));
    return 0;
}

$ cc -Wall -Wextra -pedantic -std=c2x bitfields.c

$ ./a.out
12

GCC 似乎（实验性地）理解 [[gnu::packed]] 以便它尽可能多地压缩位域（至少按字节计算；我不处理不是 8 的倍数的整数宽度，所以我对此没意见）。

我不确定这里是否有任何 UB，但我想我必须非常倒霉，而且 GCC 真的很邪恶，才能破解上面的代码。

Answer 1

Is the code above correct in both C and C++ (GNU dialects)? Or does it rely on Undefined Behavior?

嗯，按照标准是UB

根据 GNU，所有这些细节都是“Determined by ABI”。

根据 Agner Fog 对 x86 调用约定的总结 [pdf],

Objects of structures and classes are stored by placing the data members consecutively in memory. Unused bytes may be inserted between elements and after the last element, if needed, for the sake of alignment. The requirements for alignment are ...

没有允许重新排序的语言，也没有关于位域的特殊语言。因此，这应该广泛涵盖 x86 上的 GNU。如果你想在不同的平台上使用 GNU，要么说出它是什么，要么查看它的 ABI 文档。

注意。其他语言功能 - 特别是指针和 C++ 引用 - 可能无法正常工作或根本无法处理未对齐和奇怪大小的数据成员。

有时以缓慢的方式反序列化为正常对齐的结构，对正确对齐的正常大小的类型进行操作，并在必要时重新序列化结果会更有效。

Answer 2

关于x86_64的实验性答案（只需在其他架构上编译相同的代码来测试它们）：

// bitfields.c:

#define _DEFAULT_SOURCE
#include <endian.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

struct [[gnu::packed]] s2 {
    uint8_t   a;
    uintmax_t b : 24;
    uint16_t  c;
    uintmax_t d : 48;
};

uint32_t be24toh(uint32_t be)
{
#if BYTE_ORDER == BIG_ENDIAN
    return be;
#elif BYTE_ORDER == LITTLE_ENDIAN
    return be32toh(be) >> 8;
#else
    #error "wtf"
#endif
}

uint64_t be48toh(uint64_t be)
{
#if BYTE_ORDER == BIG_ENDIAN
    return be;
#elif BYTE_ORDER == LITTLE_ENDIAN
    return be64toh(be) >> 16;
#else
    #error "wtf"
#endif
}

void parse_s2(struct s2 *s, const unsigned char *raw)
{
    memcpy(s, raw, sizeof(*s));
    s->b = be24toh(s->b);
    s->c = be16toh(s->c);
    s->d = be48toh(s->d);
}

int main(void)
{
    struct s2 s;
    unsigned char raw[sizeof(s)] = {1,2,3,4,5,6,7,8,9,0xA,0xB,0xC};

    parse_s2(&s, raw);

    puts("sizeof:");
    printf("s:   %zu\n", sizeof(s));
    printf("s.a: %zu\n", sizeof(s.a));
    printf("s.c: %zu\n", sizeof(s.c));

    puts("offsetof:");
    printf("s.a: %zu\n", offsetof(struct s2, a));
    printf("s.c: %zu\n", offsetof(struct s2, c));

    puts("contents:");
    printf("s.a: %#.2jx\n", (uintmax_t) s.a);
    printf("s.b: %#.6jx\n", (uintmax_t) s.b);
    printf("s.c: %#.4jx\n", (uintmax_t) s.c);
    printf("s.d: %#.12jx\n", (uintmax_t) s.d);
    return 0;
}

结果：

$ cc -Wall -Wextra -pedantic -std=c2x bitfields.c 
$ ./a.out 
sizeof:
s:   12
s.a: 1
s.c: 2
offsetof:
s.a: 0
s.c: 4
contents:
s.a: 0x01
s.b: 0x020304
s.c: 0x0506
s.d: 0x0708090a0b0c

$ c++ -Wall -Wextra -pedantic -std=c++20 bitfields.c 
$ ./a.out 
sizeof:
s:   12
s.a: 1
s.c: 2
offsetof:
s.a: 0
s.c: 4
contents:
s.a: 0x01
s.b: 0x020304
s.c: 0x0506
s.d: 0x0708090a0b0c

它在 GCC 11 (x86_64) 上按预期工作并且不会触发任何警告。

将压缩位域用于结构的非二次方字节整数成员

Using packed bitfields for non-power-of-two byte integral members of a structure

c

c++

gcc

language-lawyer

bit-fields