使用 union 进行转换的可移植性

Portability of using union for conversion

我想使用 RGBA 值表示一个 32 位数字,使用并集生成所述数字的值是否可移植?考虑这个 C 代码;

union pixel {
    uint32_t value;
    uint8_t RGBA[4];
};

这个编译很好,我喜欢用它而不是一堆函数。但是安全吗?

使用联合进行“类型双关”在 C 中很好,在 gcc 的 C++ 中也很好(作为 gcc [g++] 扩展)。但是,通过联合的“类型双关”具有硬件架构 字节序考虑因素

这称为 "type punning",出于字节序考虑,它不能直接移植。但是,除此之外,这样做就好了。 C 标准并没有很好地表明这很好,但显然是这样。阅读这些答案和来源:

  1. Is type-punning through a union unspecified in C99, and has it become specified in C11?
  2. Unions and type-punning
  3. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Type%2Dpunning - 在 gcc C 和 C++ 中允许类型双关

此外,C18 草案 N2176 ISO/IEC 9899:2017 在“6.5.2.3 结构和联合成员”部分中声明,脚注 97 中的以下内容:

  1. If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called “type punning”). This might be a trap representation.

在此处的屏幕截图中查看:

所以,有

typedef union my_union_u
{
    uint32_t value;
    /// A byte array large enough to hold the largest of any value in the union.
    uint8_t bytes[sizeof(uint32_t)];
} my_union_t;

作为一种将 value 翻译成 bytes 的方法在 C 中很好。在 C++ 中,它作为 GNU gcc 扩展(但不是 C++ 标准的一部分)工作。 See @Christoph's explanation in his answer here:

GNU extensions to standard C++ (and to C90) do explicitly allow type-punning with unions. Other compilers that don't support GNU extensions may also support union type-punning, but it's not part of the base language standard.


下载代码: 您可以从我的 eRCaGuy_hello_world repo here: "type_punning.c" 下载和 运行 下面的所有代码。 C C++ 的 gcc build 和 运行 命令可在文件最顶部的注释中找到。


因此,您可以执行类似这样的操作来读取 uint32_t value:

中的各个字节

技巧 1:基于联合的类型双关(这个“类型双关”):

这就是“类型双关”的意思:将一种类型写入联合体,然后读出另一种类型,从而利用联合体进行类型“转换”。

my_union_t u;

// write to uint32_t value
u.value = 1234;

// read individual bytes from uint32_t value
printf("1st byte = 0x%02X\n", (u.bytes)[0]);
printf("2nd byte = 0x%02X\n", (u.bytes)[1]);
printf("3rd byte = 0x%02X\n", (u.bytes)[2]);
printf("4th byte = 0x%02X\n", (u.bytes)[3]);

示例输出:

  1. little-endian 架构上:
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    
  2. big-endian 架构上:
    1st byte = 0x00
    2nd byte = 0x00
    3rd byte = 0x04
    4th byte = 0xD2
    

您也可以使用原始指针从变量中获取字节,但这种技术也存在硬件架构字节顺序问题。

如果您也想使用原始指针,这可以在没有联合的情况下完成,如下所示:

技巧 2:读取原始指针(这是不是“类型双关”):

uint32_t value = 1234;
uint8_t *bytes = (uint8_t *)&value;

// read individual bytes from uint32_t value
printf("1st byte = 0x%02X\n", bytes[0]);
printf("2nd byte = 0x%02X\n", bytes[1]);
printf("3rd byte = 0x%02X\n", bytes[2]);
printf("4th byte = 0x%02X\n", bytes[3]);

示例输出:

  1. little-endian 架构上:
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    
  2. big-endian 架构上:
    1st byte = 0x00
    2nd byte = 0x00
    3rd byte = 0x04
    4th byte = 0xD2
    

您可以使用位掩码和位移位来避免硬件架构字节顺序可移植性问题。

为了避免上述 联合类型双关 原始指针 方法都存在的字节序问题,您可以使用类似以下的方法反而。这避免了硬件架构之间的字节顺序差异:

技巧 3.1:使用位掩码和移位(这是不是“类型双关”):

uint32_t value = 1234;

uint8_t byte0 = (value >> 0)  & 0xff;
uint8_t byte1 = (value >> 8)  & 0xff;
uint8_t byte2 = (value >> 16) & 0xff;
uint8_t byte3 = (value >> 24) & 0xff;

printf("1st byte = 0x%02X\n", byte0);
printf("2nd byte = 0x%02X\n", byte1);
printf("3rd byte = 0x%02X\n", byte2);
printf("4th byte = 0x%02X\n", byte3);

示例输出(上述技术与字节顺序无关!):

  1. 所有架构上:大端小端
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    

或:

技巧 3.2:使用方便的宏来进行位掩码和位移:

#define BYTE(value, byte_num) ((uint8_t)(((value) >> (8*(byte_num))) & 0xff))

uint32_t value = 1234;

uint8_t byte0 = BYTE(value, 0);
uint8_t byte1 = BYTE(value, 1);
uint8_t byte2 = BYTE(value, 2);
uint8_t byte3 = BYTE(value, 3);

// OR

uint8_t bytes[] = {
    BYTE(value, 0), 
    BYTE(value, 1), 
    BYTE(value, 2), 
    BYTE(value, 3), 
};

printf("1st byte = 0x%02X\n", byte0);
printf("2nd byte = 0x%02X\n", byte1);
printf("3rd byte = 0x%02X\n", byte2);
printf("4th byte = 0x%02X\n", byte3);
printf("---------------\n");
printf("1st byte = 0x%02X\n", bytes[0]);
printf("2nd byte = 0x%02X\n", bytes[1]);
printf("3rd byte = 0x%02X\n", bytes[2]);
printf("4th byte = 0x%02X\n", bytes[3]);

示例输出(上述技术与字节顺序无关!):

  1. 所有架构上:大端小端
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    ---------------
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    

否则,如果架构是 小端,(my_pixel.RGBA)[0](u.bytes)[0] 可能等于 byte0(正如我在上面定义的那样) ,或者等于 byte3 如果架构是 Big-endian.

请参阅下面的字节顺序图:https://en.wikipedia.org/wiki/Endianness。请注意,在 big-endian 中,任何给定变量的 most-significant-byte 首先存储在内存中(意思是:在较低的地址中),但在小端中,它是 least-significant-byte 首先存储在内存中(在较低地址中)。还请记住,字节顺序描述的是 byte 顺序,而不是 bit 顺序(字节内的位顺序与字节顺序无关),并且每个字节都是2 个十六进制字符,或“半字节”,其中半字节为 4 位。

根据上面的维基百科文章,网络协议通常使用big-endian字节顺序,而大多数处理器(x86, most ARM, etc.),通常是little-endian(重点加):

Big-endianness is the dominant ordering in networking protocols, such as in the internet protocol suite, where it is referred to as network order, transmitting the most significant byte first. Conversely, little-endianness is the dominant ordering for processor architectures (x86, most ARM implementations, base RISC-V implementations) and their associated memory.


关于标准是否支持“类型双关”的更多说明

根据 Wikipedia's "Type punning" article, writing to union member value but reading from RGBA[4] is "unspecified behavior". However, ,维基百科是错误的。此答案顶部的其他参考文献也与现在所写的维基百科答案不一致。

,我现在理解并同意,声明(强调):

The quoted text, about bytes corresponding to union members other than the last one stored, does not apply to this situation. It applies to a case where, for example, a two-byte short member is written and a four-byte int member is read. The extra two bytes are unspecified. This gives a C implementation license to implement the store to the short as a two-byte store (leaving the remaining bytes of the union unchanged) or a four-byte store (perhaps because it is efficient for the processor). In the case at hand, we have a four-byte uint32_t member and a four-byte uint8_t [4] member.

维基百科声明(截至 2021 年 4 月 22 日):

联合:

union {
    unsigned int ui;
    float d;
} my_union = { .d = x };

Accessing my_union.ui after initializing the other member, my_union.d, is still a form of type-punning [4] in C and the result is unspecified behavior [5] (and undefined behavior in C++ [6]).

来自reference [5] above:“未指定的行为”包括:

The values of bytes that correspond to union members other than the one last stored into (6.2.6.1).

这意味着如果您将数据存储到联合体的一个成员,但从另一个读取它,这 正是您想要将联合用于 的内容,根据 C 标准,它是“未指定的行为”。

我认为 gcc 允许类型双关(写入联合的一个成员,但从联合的另一个成员读取,作为一种“翻译”的形式)作为“gcc 扩展”,但是C 和 C++ 标准,如果在你的构建标志中使用 -Wpedantic,否则禁止它。

另见:

  1. 从我的存储库中下载并 运行 以上所有代码:https://github.com/ElectricRCAircraftGuy/eRCaGuy_hello_world/blob/master/c/type_punning.c
  2. Unions and type-punning
  3. [my repo] 我将 READ_BYTE() 作为宏添加到我的 utilities.h file in my eRCaGuy_hello_world repo。
  4. Where do I find the current C or C++ standard documents?
  5. https://news.ycombinator.com/item?id=17263328
    1. Is type-punning through a union unspecified in C99, and has it become specified in C11? <== 特意看这里。显然,C 标准并没有很好地说明这一点。
  6. 我的更多答案:
    1. .
    2. .
    3. .