为什么 unicode 字符在 C++ std::string 中被一视同仁?

Why are unicode characters treated the same in C++ std::string?

这是一个 Ideone:http://ideone.com/vjByty

#include <iostream>
using namespace std;
#include <string>

int main() {
    string s = "\u0001\u0001";
    cout << s.length() << endl;
    if (s[0] == s[1]) {
        cout << "equal\n";
    }
    return 0;
}

我在很多层面上都很困惑。

当我在 C++ 程序中键入转义的 Unicode 字符串文字时,这意味着什么?

2个字符不应该占用4个字节吗? (假设为 utf-16)

为什么s(前两个字节)的前两个字符相等?

所以 C++11 标准草案对窄字符串文字中的通用字符有以下说明(强调我的前进):

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (2.14.3), except that the single quote [...] In a narrow string literal, a universal-charactername may map to more than one char element due to multibyte encoding

并包括以下注释:

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating ’[=13=]’.

上面提到的 2.14.3 部分说:

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation defined encoding.

如果我尝试这个例子 (see it live):

string s = "\u0F01\u0001";

第一个通用字符映射到多个字符。

What does it mean when I type in an escaped Unicode string literal in my C++ program?

引用标准:

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.

通常,执行字符集将是 ASCII,其中包含一个值为 1 的字符。因此 \u0001 将被转换为一个值为 1 的字符。

如果您要指定非 ASCII 字符,例如 \u263A,您可能会看到每个字符超过一个字节。

Shouldn't it take 4 bytes for 2 characters? (assuming utf-16)

如果是 UTF-16,是的。但是 string 不能用 UTF-16 编码,除非 char 有 16 位,而它通常没有。 UTF-8 是一种更有可能的编码,其中值最大为 127 的字符(即整个 ASCII 集)用单个字节编码。

Why are the first two characters of s (first two bytes) equal?

根据以上假设,它们都是值为1的字符