如何从 (w) 字符串中获取 unicode 字符的 utf-8 int 值？

Question

情况

我需要一个函数，它需要一个字符串并将所有非 ascii 字符编码为 utf-8 作为十六进制数，并用它替换它。

例如，像"djvӷdio"这样的单词中的ӷ应该替换为"d3b7"，而其余的保持不变。

Explanation:
ӷ equals int 54199 and in hexadecimal d3b7
djvӷdio --> djvd3b7dio

我已经有一个函数 returns 一个 int 的十六进制值。

我的机器

kubuntu，19.10
编译器：g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008

我的想法

1。想法

std::string encode_utf8(const std::string &str);

使用上面的函数，我遍历包含 unicode 的整个字符串，如果当前字符是非 ascii，我将其替换为它的十六进制值。

问题：

使用 unicode 迭代字符串并不聪明，因为与普通 char 不同，unicode char 最多由 4 个字节组成。因此，一个 unicode char 可以被视为输出垃圾的多个 chars。简而言之，字符串无法被索引。

2。想法

std::string encode_utf8(const std::wstring &wstr);

再次，我用 unicode 字符遍历整个字符串，如果当前字符是非 ascii，我将其替换为它的十六进制值。

问题：

索引现在可以工作了，但是它 returns 一个 wchar_t 具有相应的 utf-32 编号，但我绝对需要 utf-8 编号。

如何从字符串中获取一个字符，从中获取 utf-8 十进制数？

Answer 1

您输入的字符串是 UTF8 编码的，这意味着每个字符都由 1 到 4 个字节的任意字符编码。你不能只扫描字符串并转换它们，除非你的循环了解 Unicode 字符是如何在 UTF8 中编码的。

您需要一个 UTF8 解码器。

幸运的是，如果您只需要解码，您可以使用非常轻量级的。 UTF8-CPP 几乎是一个 header，并且具有为您提供单独的 Unicode 字符的功能。 utf8::next 将为您提供 uint32_t（"largest" 字符的代码点适合此类型的 object）。现在您可以简单地查看该值是否小于 128：如果是，则转换为 char 并追加；如果不是，请以您认为合适的任何方式序列化整数。

不过，我恳请您考虑一下这是否是您真正想要做的。您的输出将是模棱两可的。无法确定其中的一堆数字是实际数字，还是某些 non-ASCII 字符的表示。为什么不坚持使用原始的 UTF8 编码，或者使用 HTML 实体编码或 quoted-printable 之类的东西？这些编码被广泛理解和广泛支持。

Answer 2

我刚刚解决了这个问题：

std::string Tools::encode_utf8(const std::wstring &wstr)
{
    std::string utf8_encoded;

    //iterate through the whole string
    for(size_t j = 0; j < wstr.size(); ++j)
    {
        if(wstr.at(j) <= 0x7F)
            utf8_encoded += wstr.at(j);
        else if(wstr.at(j) <= 0x7FF)
        {
            //our template for unicode of 2 bytes
            int utf8 = 0b11000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the last 5 remaining bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00000111'11000000) << 2;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\x").insert(4, "\x"));
        }
        else if(wstr.at(j) <= 0xFFFF)
        {
            //our template for unicode of 3 bytes
            int utf8 = 0b11100000'10000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the next 6 bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00001111'11000000) << 2;

            /*
             * get the last 4 remaining bits
             * put them 4 to the left so that the 10xx from 10xxxxxx (second byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b11110000'00000000) << 4;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\x").insert(4, "\x").insert(8, "\x"));
        }
        else if(wstr.at(j) <= 0x10FFFF)
        {
            //our template for unicode of 4 bytes
            int utf8 = 0b11110000'10000000'10000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the next 6 bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00001111'11000000) << 2;

            /*
             * get the next 6 bits
             * put them 4 to the left so that the 10xx from 10xxxxxx (second byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00000011'11110000'00000000) << 4;

            /*
             * get the last 3 remaining bits
             * put them 6 to the left so that the 10xxxx from 10xxxxxx (third byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00011100'00000000'00000000) << 4;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\x").insert(4, "\x").insert(8, "\x").insert(12, "\x"));
        }
    }
    return utf8_encoded;
}

如何从 (w) 字符串中获取 unicode 字符的 utf-8 int 值？

How can I to get the utf-8 int value of a unicode char from a (w)string?

c++

unicode

utf

情况

我的机器

我的想法

1。想法

2。想法