为什么 unicode 字符在 C++ std::string 中被一视同仁?
Why are unicode characters treated the same in C++ std::string?
这是一个 Ideone:http://ideone.com/vjByty。
#include <iostream>
using namespace std;
#include <string>
int main() {
string s = "\u0001\u0001";
cout << s.length() << endl;
if (s[0] == s[1]) {
cout << "equal\n";
}
return 0;
}
我在很多层面上都很困惑。
当我在 C++ 程序中键入转义的 Unicode 字符串文字时,这意味着什么?
2个字符不应该占用4个字节吗? (假设为 utf-16)
为什么s
(前两个字节)的前两个字符相等?
所以 C++11 标准草案对窄字符串文字中的通用字符有以下说明(强调我的前进):
Escape sequences and universal-character-names in non-raw string literals have the same meaning as in
character literals (2.14.3), except that the single quote [...] In a narrow string literal, a universal-charactername
may map to more than one char element due to multibyte encoding
并包括以下注释:
The size of a narrow string literal is the total
number of escape sequences and other characters, plus at least one for the multibyte encoding of each
universal-character-name, plus one for the terminating ’[=13=]’.
上面提到的 2.14.3
部分说:
A universal-character-name is translated to the encoding, in the appropriate execution character set, of the
character named. If there is no such encoding, the universal-character-name is translated to an implementation defined
encoding.
如果我尝试这个例子 (see it live):
string s = "\u0F01\u0001";
第一个通用字符映射到多个字符。
What does it mean when I type in an escaped Unicode string literal in my C++ program?
引用标准:
A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.
通常,执行字符集将是 ASCII,其中包含一个值为 1 的字符。因此 \u0001
将被转换为一个值为 1 的字符。
如果您要指定非 ASCII 字符,例如 \u263A
,您可能会看到每个字符超过一个字节。
Shouldn't it take 4 bytes for 2 characters? (assuming utf-16)
如果是 UTF-16,是的。但是 string
不能用 UTF-16 编码,除非 char
有 16 位,而它通常没有。 UTF-8 是一种更有可能的编码,其中值最大为 127 的字符(即整个 ASCII 集)用单个字节编码。
Why are the first two characters of s (first two bytes) equal?
根据以上假设,它们都是值为1的字符
这是一个 Ideone:http://ideone.com/vjByty。
#include <iostream>
using namespace std;
#include <string>
int main() {
string s = "\u0001\u0001";
cout << s.length() << endl;
if (s[0] == s[1]) {
cout << "equal\n";
}
return 0;
}
我在很多层面上都很困惑。
当我在 C++ 程序中键入转义的 Unicode 字符串文字时,这意味着什么?
2个字符不应该占用4个字节吗? (假设为 utf-16)
为什么s
(前两个字节)的前两个字符相等?
所以 C++11 标准草案对窄字符串文字中的通用字符有以下说明(强调我的前进):
Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (2.14.3), except that the single quote [...] In a narrow string literal, a universal-charactername may map to more than one char element due to multibyte encoding
并包括以下注释:
The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating ’[=13=]’.
上面提到的 2.14.3
部分说:
A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation defined encoding.
如果我尝试这个例子 (see it live):
string s = "\u0F01\u0001";
第一个通用字符映射到多个字符。
What does it mean when I type in an escaped Unicode string literal in my C++ program?
引用标准:
A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.
通常,执行字符集将是 ASCII,其中包含一个值为 1 的字符。因此 \u0001
将被转换为一个值为 1 的字符。
如果您要指定非 ASCII 字符,例如 \u263A
,您可能会看到每个字符超过一个字节。
Shouldn't it take 4 bytes for 2 characters? (assuming utf-16)
如果是 UTF-16,是的。但是 string
不能用 UTF-16 编码,除非 char
有 16 位,而它通常没有。 UTF-8 是一种更有可能的编码,其中值最大为 127 的字符(即整个 ASCII 集)用单个字节编码。
Why are the first two characters of s (first two bytes) equal?
根据以上假设,它们都是值为1的字符