将字符转换为整数的微妙之处

Subtlety in conversion of characters to integers

谁能解释清楚 K&R 中这些行的实际含义:

"When a char is converted to an int, can it ever produce a negative integer? The answer varies from machine to machine. The definition of C guarantees that any character in the machine's standard printing character set will never be negative, but arbitrary bit patterns stored in character variables may appear to be negative on some machines,yet positive on others".

标准有两个或多或少相关的部分 — ISO/IEC 9899:2011.

6.2.5 Types

¶3 An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.

¶15 The three types char, signed char, and unsigned char are collectively called the character types. The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char.45)

45) CHAR_MIN, defined in <limits.h>, will have one of the values 0 or SCHAR_MIN, and this can be used to distinguish the two options. Irrespective of the choice made, char is a separate type from the other two and is not compatible with either.

这定义了您从 K&R 引用的内容。其他相关部分定义了基本执行字符集是什么。

5.2.1 Character sets

¶1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

¶2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

¶3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet

A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z

the 26 lowercase letters of the Latin alphabet

a b c d e f g h i j k l m
n o p q r s t u v w x y z

the 10 decimal digits

0 1 2 3 4 5 6 7 8 9

the following 29 graphic characters

! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~

the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never converted to a token), the behavior is undefined.

¶4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.

¶5 The universal character name construct provides a way to name other characters.

这些规则的一个结果是,如果一台机器使用 8 位字符和 EBCDIC 编码,那么普通 char 必须是无符号类型,因为数字的代码为 240..249 EBCDIC.

你需要先了解几件事。

  1. 如果我把一个8位的值扩展成16位的值,通常你会想象只是在左边加一串0。例如,如果我有 8 位值 23,在二进制中是 00010111,那么作为 16 位数字它是 0000000000010111,这也是 23.

  2. 原来负数的高位总是1。 (可能有些奇怪的机器不是这样,但它适用于您可能使用的任何机器。)例如,8 位值 -40 在二进制中表示为 11011000。

  3. 所以当你将一个有符号的8位值转换为16位值时,如果高位为1(即如果该数为负数),则不加左边有一堆 0,你可以添加一堆 1。例如,返回-40,我们将 11011000 转换为 1111111111011000,这是 -40 的 16 位表示。

  4. 还有无符号数,它们永远不会是负数。例如,8 位无符号数 216 表示为 11011000。(您会注意到这与有符号数 -40 的位模式相同。)当您将 8 位无符号数转换为 16 位时,您添加无论如何都是一堆0。例如,您会将 11011000 转换为 0000000011011000,这是 216 的 16 位表示形式。

  5. 因此,将所有这些放在一起,如果您要将 8 位数字转换为 16(或更多)位,则必须考虑两件事。首先,数字是有符号的还是无符号的?如果是无符号的,就在左边加一串0。但是如果是有符号的,就得看8-0bit数的高位了。如果是0(如果是正数),就在左边加一串0。但是如果是1(如果是负数),就在右边加一串1。 (整个过程称为 符号扩展。)

  6. 普通的ASCII字符(如'A'和'1'和'$')的值都小于128,这意味着它们的高位总是0。但是来自 "Latin-1" 或 UTF-8 字符集的 "special" 字符的值大于 128。因此,它们有时也称为 "high bit" 或 "eighth bit" 字符。例如,Latin-1 字符 Ø(O 中有一条斜线)的值为 216。

  7. 最后,虽然 C 中的类型 char 通常是 8 位类型,但是 C 标准没有指定它是有符号还是无符号.

综上所述,Kernighan 和 Ritchie 所说的是,当我们将 char 转换为 16 位或 32 位整数时,我们不太清楚如何应用第 5 步。如果我在一台 char 类型是无符号的机器上,我把字符 Ø 转换成一个整数,我可能会得到值 216。但是如果我在一台机器上,类型 char 已签名,我可能会得到数字 -40。