循环遍历 Unicode 字符串作为字符

Question

对于以下字符串，大小输出不正确。为什么会这样，我该如何解决？

string str = " ██████";
cout << str.size();
// outputs 19 rather than 7

我试图逐个字符地遍历 str，这样我就可以将它读入一个 vector<string>，它的大小应该为 7，但我不能这样做，因为上面的代码输出 19.

Answer 1

TL;DR

basic_string 的 size() 和 length() 成员 returns 基础字符串的单位的大小，不是可见字符数。获取预期数量：

对于不包含非 BMP、不包含 combining characters and no joining characters

u

对于不包含任何组合或连接字符的非常简单的字符串，使用带 U 前缀的 UTF-32
标准化字符串并计算任意 Unicode 字符串

" ██████" 是一个 space 后跟一系列 6 U+2588 characters. Your compiler seems to be using UTF-8 for std::string. UTF-8 is a variable-length encoding 并且许多字母使用多个字节编码（因为很明显你不能只用编码超过 256 个字符一个字节）。在 UTF-8 中，U+0800 和 U+FFFF 之间的代码点由 3 个字节编码。因此 UTF-8 中字符串的长度是 1 + 6*3 = 19 字节。

您可以使用 this one 等任何 Unicode 转换器进行检查，并查看字符串是否以 UTF-8 编码为 20 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88，您还可以遍历字符串的每个字节以检查

如果你想要 字符串中可见个字符的总数 那么它就更棘手了 churill 的解决方案不起作用。阅读 Twitter

中的示例

If you use anything beyond the most basic letters, numbers, and punctuation the situation gets more confusing. While many people use multi-byte Kanji characters to exemplify these issues, Twitter has found that accented vowels cause the most confusion because English speakers simply expect them to work. Take the following example: the word “café”. It turns out there are two byte sequences that look exactly the same, but use a different number of bytes:
café  0x63 0x61 0x66 0xC3 0xA9        Using the “é” character, called the “composed character”.
café  0x63 0x61 0x66 0x65 0xCC 0x81   Using the combining diacritical, which overlaps the “e”

您需要像 ICU to normalize the string and count. Twitter for example uses Normalization Form C

这样的 Unicode 库

编辑：

由于您只对似乎不在 BMP 之外且不包含任何组合字符的方框图字符感兴趣，因此 UTF-16 和 UTF-32 都可以。与 std::string 一样，std::wstring 也是 basic_string 并且没有强制编码。在大多数实现中，它通常是 UTF-16 (Windows) 或 UTF-16 (*nix)，因此您可以使用它，但它不可靠并且取决于源代码编码。更好的方法是使用 std::u16string (std::basic_string<char16_t>) 和 std::u32string (std::basic_string<char32_t>)。无论源文件的系统和编码如何，它们都会工作

std::wstring wstr     = L" ██████";
std::u16string u16str = u" ██████";
std::u32string u32str = U" ██████";
std::cout << str.size();    // may work, returns the number of wchar_t characters
std::cout << u16str.size(); // always returns the number of UTF-16 code units
std::cout << u32str.size(); // always returns the number of UTF-32 code units

如果您对如何计算所有 Unicode 字符感兴趣，请继续阅读下文

The “café” issue mentioned above raises the question of how you count the characters in the Tweet string “café”. To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count “café” as four characters, no matter which representation is sent.

[...]

Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81). Twitter also counts the number of codepoints in the text rather than UTF-8 bytes. The 0xC3 0xA9 from the café example is one codepoint (U+00E9) that is encoded as two bytes in UTF-8, whereas 0x65 0xCC 0x81 is two codepoints encoded as three bytes

Twitter - Counting characters

另见

When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings
Why is the length of this string longer than the number of characters in it?

Answer 2

std::string 只包含 1 字节长的字符（通常是 8 位，包含 UTF-8 字符），你需要 wchar_t 和 std::wstring 来实现你想要的：

std::wstring str = L" ██████";
std::cout << str.size();

尽管这会打印 7（一个 space 和 6 个 unicode 字符）。请注意字符串文字前的 L，因此它将被解释为宽字符串。

循环遍历 Unicode 字符串作为字符

Loop through Unicode string as character

c++

string

unicode

string-length

TL;DR

编辑：