如何以 C 标准方式进行位表示?

How to do a bit representation in a C-standard way?

根据 C 标准,整数类型的值表示是实现定义的。因此 5 可能不会表示为 00000000000000000000000000000101-1 表示为 11111111111111111111111111111111,正如我们通常在 32 位 2 的补码中假设的那样。因此,即使运算符 ~<<>> 定义明确,它们将处理的位模式也是实现定义的。我能找到的唯一定义的位模式是 "§5.2.1/3 所有位都设置为 0 的字节,称为空字符,应存在于基本执行字符集中;它用于终止一个字符串。.

所以我的问题是 - 是否有独立于实现的方法将整数类型转换为位模式?

我们总是可以从一个空字符开始,并对其进行足够的位操作以使其达到所需的值,但我觉得这太麻烦了。我还意识到实际上所有实现都将使用 2 的补码表示,但我想知道如何以纯 C 标准方式进行。就我个人而言,由于设备驱动程序编程的问题,我觉得这个主题非常有趣,迄今为止编写的所有代码都假设了一个特定的实现。

如果你想获得给定 int 的位模式,那么位运算符就是你的朋友。如果您想将 int 转换为它的 2 补码表示,那么算术运算符就是您的朋友。这两种表示形式可以不同,因为它是实现定义的:

Std Draft 2011. 6.5/4. Some operators (the unary operator ~, and the binary operators <<, >>, &, ^, and |, collectively described as bitwise operators) are required to have operands that have integer type. These operators yield values that depend on the internal representations of integers, and have implementation-defined and undefined aspects for signed types.

所以这意味着 i<<1 将有效地将位模式向左移动一个位置,但是产生的值可能与 i*2 不同(即使 i).

一般来说,在大多数情况下 并不难适应不寻常的平台(如果您不想简单地假设 8 位 char,2补码、无填充、无陷阱和截断无符号到有符号的转换),该标准主要提供足够的保证(不过,一些宏来检查某些实现细节会有所帮助)。

就严格遵守的程序而言(位域外),5 始终编码为 00...0101。这不一定是物理表示(无论这应该意味着什么),而是可移植代码可以观察到的东西。例如,内部使用格雷码的机器必须为按位运算符和移位模拟 "pure binary notation"。

对于有符号类型的负值,允许使用不同的编码,这会导致在重新解释为相应的无符号类型时产生不同的(但对每种情况都有明确定义)结果。例如,对于有符号整数 n,严格遵守代码必须区分 (unsigned)n*(unsigned *)&n:它们对于没有填充位的二进制补码是相等的,但对于其他编码则不同,如果 n 为负。

此外,可能存在填充位,并且有符号整数类型可能比相应的无符号整数类型具有更多的填充位(但反之则不然,从有符号到无符号的类型双关总是有效的)。 sizeof 不能用于获取非填充位的数量,例如要获得一个无符号值,其中只设置了符号位(相应的有符号类型),必须使用类似这样的东西:

#define TYPE_PUN(to, from, x) ( *(to *)&(from){(x)} )
unsigned sign_bit = TYPE_PUN(unsigned, int, INT_MIN) &
                    TYPE_PUN(unsigned, int, -1) & ~1u;

(可能有更好的方法)而不是

unsigned sign_bit = 1u << sizeof sign_bit * CHAR_BIT - 1;

因为这可能会移动超过宽度。 (我不知道给出宽度的常量表达式,但是可以将上面的 sign_bit 右移直到它为 0 以确定它,Gcc 可以将其常量折叠。)可以通过 [= 检查填充位21=] 进入 unsigned char 数组,尽管它们可能看起来 "wobble":两次读取相同的填充位可能会产生不同的结果。

如果你想要有符号整数(小端)的位模式(没有填充位):

int print_bits_u(unsigned n) {
    for(; n; n>>=1) {
        putchar(n&1 ? '1' : '0'); // n&1 never traps
    }
    return 0;
}

int print_bits(int n) {
    return print_bits_u(*(unsigned *)&n & INT_MAX);
    /* This masks padding bits if int has more of them than unsigned int.
     * Note that INT_MAX is promoted to unsigned int here. */
}

int print_bits_2scomp(int n) {
    return print_bits_u(n);
}

print_bits 根据使用的表示形式给出不同的负数结果(它给出原始位模式),print_bits_2scomp 给出二进制补码表示形式(可能宽度大于 signed int 有,如果 unsigned int 有更少的填充位)。

在使用按位运算符和从无符号到有符号的类型双关时,必须注意不要生成陷阱表示,请参阅下文如何生成这些表示(例如,*(int *)&sign_bit 可以用两个陷阱表示补码,-1 | 1 可以补码。

无符号到有符号整数转换(如果转换后的值在目标类型中不可表示)始终是实现定义的,我希望非 2 的补码机更有可能不同于通用定义,不过从技术上讲,它也可能成为 2 的补码实现的问题。

从 C11 (n1570) 6.2.6.2:

(1) For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter). If there are N value bits, each bit shall represent a different power of 2 between 1 and 2N-1, so that objects of that type shall be capable of representing values from 0 to 2N-1 using a pure binary representation; this shall be known as the value representation. The values of any padding bits are unspecified.

(2) For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits. There shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed type and N in the unsigned type, then M≤N ). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:

  • the corresponding value with sign bit 0 is negated (sign and magnitude);
  • the sign bit has the value -(2M) (two's complement);
  • the sign bit has the value -(2M-1) (ones' complement).

Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones' complement, if this representation is a normal value it is called a negative zero.

要添加到 mafso 的出色回答中,ANSI C rationale 的一部分谈到了这个:

The Committee has explicitly restricted the C language to binary architectures, on the grounds that this stricture was implicit in any case:

  • Bit-fields are specified by a number of bits, with no mention of “invalid integer” representation. The only reasonable encoding for such bit-fields is binary.
  • The integer formats for printf suggest no provision for “invalid integer” values, implying that any result of bitwise manipulation produces an integer result which can be printed by printf.
  • All methods of specifying integer constants — decimal, hex, and octal — specify an integer value. No method independent of integers is defined for specifying “bit-string constants.” Only a binary encoding provides a complete one-to-one mapping between bit strings and integer values.

The restriction to binary numeration systems rules out such curiosities as Gray code and makes possible arithmetic definitions of the bitwise operators on unsigned types.

标准的相关部分可能是这句话:

3.1.2.5 Types

[...]

The type char, the signed and unsigned integer types, and the enumerated types are collectively called integral types. The representations of integral types shall define values by use of a pure binary numeration system.