为什么 unicode 字符在 C++ std::string 中被一视同仁？

Question

#include <iostream>
using namespace std;
#include <string>

int main() {
    string s = "\u0001\u0001";
    cout << s.length() << endl;
    if (s[0] == s[1]) {
        cout << "equal\n";
    }
    return 0;
}

我在很多层面上都很困惑。

当我在 C++ 程序中键入转义的 Unicode 字符串文字时，这意味着什么？

2个字符不应该占用4个字节吗？（假设为 utf-16）

为什么s（前两个字节）的前两个字符相等？

Answer 1

所以 C++11 标准草案对窄字符串文字中的通用字符有以下说明（强调我的前进）：

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (2.14.3), except that the single quote [...] In a narrow string literal, a universal-charactername may map to more than one char element due to multibyte encoding

并包括以下注释：

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating ’[=13=]’.

上面提到的 2.14.3 部分说：

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation defined encoding.

如果我尝试这个例子 (see it live):

string s = "\u0F01\u0001";

第一个通用字符映射到多个字符。

Answer 2

What does it mean when I type in an escaped Unicode string literal in my C++ program?

引用标准：

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.

通常，执行字符集将是 ASCII，其中包含一个值为 1 的字符。因此 \u0001 将被转换为一个值为 1 的字符。

如果您要指定非 ASCII 字符，例如 \u263A，您可能会看到每个字符超过一个字节。

Shouldn't it take 4 bytes for 2 characters? (assuming utf-16)

如果是 UTF-16，是的。但是 string 不能用 UTF-16 编码，除非 char 有 16 位，而它通常没有。 UTF-8 是一种更有可能的编码，其中值最大为 127 的字符（即整个 ASCII 集）用单个字节编码。

Why are the first two characters of s (first two bytes) equal?

根据以上假设，它们都是值为1的字符

为什么 unicode 字符在 C++ std::string 中被一视同仁？

Why are unicode characters treated the same in C++ std::string?

c++

unicode