为什么 '€' == '\€' 但 "€" != "\€" 和 u8"€" != u8"\€"

Question

在到处阅读 utf8 后，我试图更改我的一些代码以使用 std::string。我假设如果我将 std::string 设置为 u8"€"（这是我键盘上的欧元符号 AltGr+4），std::string 将有 3 个字节包含欧元符号的 unicode 代码 (\U20AC) .它没有。考虑

std::string x[] = {"€", u8"€", u8"\€", "\u20AC", u8"\u20AC"}

size_t size[] = {x[0].size(), x[1].size(), x[2].size(), x[3].size(), x[4].size()};

如果我在调试器局部变量中查看结果，我会看到

x[] = {"€", "€", "â??", "â‚¬", "â‚¬"}

和

size[] = {1, 1, 3, 3, 3}

据我所知，最后两个是唯一给我预期结果的。我显然缺少与字符串文字有关的东西，但我也很困惑调试器如何为前两个显示正确的字符串，因为它认为它们是一个字符长并且 (int64_t(x[0].c_str()[0]) == int64_t(x[1].c_str()[0]) == -128.

还有为什么 '€' == '\€' 但 "€" != "\€" 和 u8"€" != u8"\€"。（编辑：忽略这个。雷米指出我的重新比较 char 指针时出现以下错误）。

结果也引出了一个问题，u8 字符串文字前缀的用途是什么？

在我回到 wchar_t 之前有人可以解释一下吗？

我正在 Windows 10 使用 RAD studio 10.2。

编辑：使用字符映射工具对各种非 ASCII Unicode 字符进行了尝试。无法让它与他们中的任何一个一起工作。 size() 始终为 1，并且调试器显示了与我使用的字符不同的字符（通常是“?”）。我使用的是 Surface Pro Type Cover，据我所知，无法使用键盘输入随机 Unicode 字符（€ 除外）。从现在开始为我严格反斜杠代码。很高兴我已经清理了它，即使我浪费了一整天。谢谢大家。

Answer 1

I assumed if I set a std::string to u8"€" (that's the euro symbol AltGr+4 on my keyboard) the std::string would have 3 bytes containing the unicode code (\U20AC) for the euro symbol. It doesn't.

应该，是的。 u8 前缀保证文字在最终可执行文件中存储为 UTF-8，并且 U+20AC 确实在 UTF-8 中编码为 3 个字节。如果您看到不同的东西，那可能是一个编译器错误，应该报告给 Embarcadero。

I'm also puzzled how the debugger shows the correct string for the first two given it thinks they're one char long and (int64_t(x[0].c_str()[0]) == int64_t(x[1].c_str()[0]) == -128.

第二个应该是 3 个字节，而不是 1 个字节。

因为都是1个字节，所以显示只是偶然。字符串文字上没有前缀，因此使用编译器的默认 ANSI 字符集对其进行解释，在您的情况下，它必须恰好在字节 0x80 处有欧元符号。

Also why does '€' == '\€' but "€" != "\€" and u8"€" != u8"\€".

因为第一个比较的是实际的 char 值，而其他的比较的是原始 char* 指针，而不是实际的 char 值。

The results also beg the question what is the purpose of the u8 string literal prefix?

正是您所期望的 - 应该使编译器以 UTF-8 编码输出字符串文字的内容。

为什么 '€' == '\€' 但 "€" != "\€" 和 u8"€" != u8"\€"

Why does '€' == '\€' but "€" != "\€" and u8"€" != u8"\€"

c++

string-literals