使用 C++17 处理 Unicode 的高效、符合标准的机制是什么？

Question

短版：
如果我想编写可以高效地执行 Unicode 字符操作的程序，能够以 UTF-8 或 UTF-16 编码输入和输出文件。使用 C++ 执行此操作的合适方法是什么？

长版：
C++ 早于 Unicode，并且自那以后两者都发生了显着的发展。我需要知道如何编写无泄漏的符合标准的 C++ 代码。我需要一个明确的答案：

我应该选择哪个字符串容器？
- std::string 使用 UTF-8?
- std::wstring（不是很了解）
- std::u16string 使用 UTF-16?
- std::u32string 使用 UTF-32?
我应该完全坚持使用上述容器之一还是在需要时更换它们？
当使用 UTF 字符串时，我能否在字符串文字中使用非英语字符，例如波兰语字符：ąćęłńśźż 等？
当我们在 std::string 中存储 UTF-8 编码的字符时会发生什么变化？它们仅限于一字节的 ASCII 字符还是可以是多字节的？
当我执行以下操作时会发生什么？
```
 std::string s = u8"foo";
 s += 'x';
```
wchar_t和其他多字节字符类型有什么区别？ wchar_t 字符或 wchar_t 字符串文字是否能够存储 UTF 编码？

Answer 1

Which string container should I pick?

这完全由您根据自己的特定需求来决定。您提供的任何选择都可以，并且它们各有优缺点。通常，UTF-8 适合用于存储和通信目的，并且向后兼容 ASCII。而 UTF-16/32 在处理 Unicode 数据时更容易使用。

std::wstring (don't really know much about it)

wchar_t的大小是compiler-dependent甚至是platform-dependent。例如，在 Windows 上，wchar_t 是 2 个字节，使得 std::wstring 可用于 UTF-16 编码的字符串。在其他平台上，wchar_t 可能是 4 个字节，使 std::wstring 可用于 UTF-32 编码的字符串。这就是为什么 wchar_t/std::wstring 通常不用于可移植代码，以及 char16_t/std::u16string 和 char32_t/std::u32string 被引入的原因C++11。即使 char 也可能存在 UTF-8 的可移植性问题，因为 char 可以根据编译器供应商的决定进行签名或未签名，这就是 char8_t/std::u8string 的原因在 C++20 中针对 UTF-8 引入。

Should I stick entirely to one of the above containers or change them when needed?

使用适合您需要的任何容器。

通常，您应该在整个代码中使用一种字符串类型。仅在字符串数据 enters/leaves 您的程序的边界处执行数据转换。比如当reading/writing文件、网络通信、平台系统调用等

How to properly convert between them?

有很多方法可以解决这个问题。

C++11及以后有std::wstring_convert/std::wbuffer_convert。但这些在 C++17 中已被弃用。

有第 3 方 Unicode 转换库，例如 ICONV、ICU 等

有C库函数，平台系统调用等

Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?

是，如果您使用适当的字符串文字前缀：

u8 用于 UTF-8。

L 用于 UTF-16 或 UTF-32（取决于 compiler/platform）。

u16 用于 UTF-16。

u32 用于 UTF-32。

另外，请注意您用于保存源文件的字符集会影响编译器解释字符串文字的方式。因此，请确保您选择用于保存文件的任何字符集（如 UTF-8）告诉编译器该字符集是什么，否则您可能会在运行时得到错误的字符串值。

What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?

每个字符串字符可以是 single-byte，或者是 Unicode 代码点的 multi-byte 表示的一部分。这取决于字符串的编码和被编码的字符。

正如 std::wstring（当 wchar_t 为 2 个字节时）和 std::u16string 可以保存包含 Unicode BMP 之外的增补字符的字符串，这需要 UTF-16 代理进行编码。

当一个字符串容器包含一个UTF编码的字符串时，每个“字符”只是一个UTF编码的codeunit。 UTF-8 将 Unicode 代码点编码为 1-4 个代码单元（[=39= 中的 1-4 chars）。UTF-16 将代码点编码为 1-2 个代码单元（1-2 wchar_t s/char16_ts in a std::wstring/std::u16string)。UTF-32 将代码点编码为 1 个代码单元（1 char32_t in a std::u32string）。

What happens when i do the following?
std::string s = u8"foo";
s += 'x';

正是您所期望的。 std::string 包含 char 个元素。无论编码如何，operator+=(char) 只会在 std::string.

的末尾附加一个 char

How can I distinguish UTF char[] and non-UTF char[] or std::string?

您需要了解字符串原始编码的外部知识，或者对 char[]/std::string 数据执行您自己的启发式分析以查看它是否符合 UTF。

What are differences between wchar_t and other multi-byte character types?

字节大小和 UTF 编码。

char = ANSI/MBCS 或 UTF-8

wchar_t = DBCS、UTF-16 或 UTF-32，取决于 compiler/platform

char8_t = UTF-8

char16_t = UTF-16

char32_t = UTF-32

Is wchar_t character or wchar_t string literal capable of storing UTF encodings?

是，UTF-16 或 UTF-32，取决于 compiler/platform。在 UTF-16 的情况下，单个 wchar_t 只能包含 BMP 中的代码点值。 UTF-32 中的单个 wchar_t 可以包含任何代码点值。 wchar_t 字符串可以使用任一编码对所有代码点进行编码。

How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?

这是一个非常宽泛的话题，值得单独提出一个问题。

使用 C++17 处理 Unicode 的高效、符合标准的机制是什么？

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?

c++

unicode

encoding

locale

utf