如何解析包含 unicode 文字的 std::string？

Question

我有 std::string 存储以 UTF 编码的字符。示例：

std::string a = "\u00c1\u00c4\u00d3";

注意a的长度为18（3个字符，每个UTF字符6个ASCII符号）。

问题：如何将a转换成只有3个字符的C++字符串？是否有任何标准函数（库）可以做到这一点？

Answer 1

标准 C++ 库中没有任何东西可以自动为您处理这种转换。您将不得不自己解析此字符串，手动将每个 6-char "\uXXXX" 子字符串转换为 1-wchar 值 0xXXXX，然后您可以将其存储到 std::wstring 或 std::u16string 根据需要。

例如：

std::string a = "\u00c1\u00c4\u00d3";

std::wstring ws;
ws.reserve(a.size());

for(size_t i = 0; i < a.size();)
{
    char ch = a[i++];

    if ((ch == '\') && (i < a.size()) && (a[i] == 'u'))
    {
        wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(++i, 4), nullptr, 16));
        i += 4;
        ws.push_back(wc);
    }
    else
    {
        // depending on the charset used for encoding the string,
        // this may or may not need to be decoded further...
        ws.push_back(static_cast<wchar_t>(ch));
    }
}

Live Demo

或者：

std::string a = "\u00c1\u00c4\u00d3";
 
std::wstring ws;
ws.reserve(a.size());
 
size_t start = 0;
do
{
    size_t found = a.find("\u", start);
    if (found == std::string::npos) break;

    if (start < found)
    {
        // depending on the charset used for encoding the string,
        // this may or may not need to be decoded further...
        ws.insert(ws.end(), a.begin()+start, a.begin()+found);
    }
 
    wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(found+2, 4), nullptr, 16));
    ws.push_back(wc);
 
    start = found + 6;
}
while (true);
 
if (start < a.size())
{
    // depending on the charset used for encoding the string,
    // this may or may not need to be decoded further...
    ws.insert(ws.end(), a.begin()+start, a.end());
}

Live Demo

否则，请使用已经为您进行此类翻译的第 3 方库。

如何解析包含 unicode 文字的 std::string？

How to parse std::string containing unicode literals?

c++

unicode

ascii

stl