C18 标准中的矛盾(关于字符集)?

Contradiction in C18 standard (regarding character sets)?

我们在C18标准中读到:

5.1.1.2 Translation phases

The precedence among the syntax rules of translation is specified by the following phases.

  1. Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.

意味着源文件字符集被解码并映射到源字符集。

但是你可以阅读:

5.2.1 Character sets

Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set).

表示源文件字符集源字符集

那么问题来了:我到底理解错了哪一个,或者实际上哪一个是错的?

编辑:实际上我错了。请参阅下面我的回答。

你遇到过交叉编译,一个程序在一个架构上编译,在另一个架构上执行,这些架构有不同的字符集。

5.1.1.2 在读取早期处于活动状态,其中输入文件被转换为编译器的单一字符集,显然必须包含 C 程序所需的所有字符。

但是交叉编译时,执行字符集可能会有所不同。 5.2.1 允许这种可能性。当编译器发出代码时,它必须将所有字符和字符串常量转换为目标平台的字符集。在现代平台上,这是 no-op,但在某些旧平台上不是。

Meaning that the source file character set is decoded and mapped to the source character set.

不,不是那个意思。我的看法是,已经假设源代码是​​用源字符集编写的——“将源字符集映射到源字符集”到底有什么意义?他们要么是集合的一部分,要么不是。如果您为源代码选择了错误的编码,它甚至会在预处理开始之前就被拒绝。

翻译阶段 1 做了两件与此完全无关的事情:

  • 解析三字母,这是标准化的多字节序列。

  • 多字节字符映射到源字符集(在5.2.1中定义)。

    源字符集由基本字符集组成,本质上是拉丁字母加上各种常用符号(5.2.1/3),以及扩展字符集 ,即 locale- 和 implemention-specific.

    多字节字符的定义见5.2.1.2:

    The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set.

    表示各种 locale-specific 奇怪的特殊情况,例如 locale-specific 三字母组。

所有这些多字节的疯狂都可以追溯到 1990 年的第一次标准化——根据该委员会成员的轶事,这是因为来自不同欧洲国家的成员无法在他们的国家中使用各种符号键盘。

(我不确定当时这种键盘上的 AltGr 键有多普遍。无论如何,在 non-English 键盘上写 C 时,它仍然是一个受到一些严重按钮混搭的键,访问 {}[] 个符号等)

好吧,看来我还是错了。在联系了 WG14 组的 David Keaton(他们负责 C 标准)之后,我得到了这个澄清的回复:

There is a subtle distinction. The source character set is the character set in which source files are written. However, the source character set is just the list of characters available, which does not say anything about the encoding.

Phase 1 maps the multibyte encoding of the source character set onto the abstract source characters themselves.

In other words, a character that looks like this:

<byte 1><byte 2>

is mapped to this:

<character 1>

The first is an encoding that represents a character in the source character set in which the program was written. The second is the abstract character in the source character set.