如何测量非ASCII字符的正确大小?

How to measure the correct size of non-ASCII characters?

在下面的程序中,我尝试测量具有非 ASCII 字符的字符串的长度。

但是,我不确定为什么 size() 在使用非 ASCII 字符时没有打印出正确的长度。

#include <iostream>
#include <string>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::cout << "Size of " << s1 << " is " << s1.size() << std::endl;
    std::cout << "Size of " << s2 << " is " << s2.size() << std::endl;
}

输出:

Size of Hello is 5
Size of इंडिया is 18

现场演示 Wandbox

std::string::size returns the length in bytes, not in number of characters. Your second string uses an UNICODE encoding, so it may take several bytes per character. Note that the same applies to std::wstring::size since it will depend on the encoding (it returns the number of wide-chars, not actual characters: if UTF-16 is used it will match but not necessarily for other encodings, more in this answer).

要测量实际长度(符号数),您需要知道编码以便正确分隔(并因此计算)字符。例如, 可能对 UTF-8 有帮助(尽管使用的方法在 C++17 中已弃用)。

UTF-8 的另一个选项是计算第一个字节的数量 (credit to this other answer):

int utf8_length(const std::string& s) {
  int len = 0;
  for (auto c : s)
      len += (c & 0xc0) != 0x80;
  return len;
}

我已经使用 std::wstring_convert class 并得到了正确的字符串长度。

#include <string>
#include <iostream>
#include <codecvt>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cn;
    auto sz = cn.from_bytes(s2).size();
    std::cout << "Size of " << s2 << " is " << sz << std::endl;
}

现场演示 wandbox.

重要性参考 link here 更多关于 std::wstring_convert