如何解析包含 unicode 文字的 std::string?
How to parse std::string containing unicode literals?
我有 std::string
存储以 UTF 编码的字符。示例:
std::string a = "\u00c1\u00c4\u00d3";
注意a
的长度为18(3个字符,每个UTF字符6个ASCII符号)。
问题:如何将a
转换成只有3个字符的C++字符串?是否有任何标准函数(库)可以做到这一点?
标准 C++ 库中没有任何东西可以自动为您处理这种转换。您将不得不自己解析此字符串,手动将每个 6-char "\uXXXX"
子字符串转换为 1-wchar 值 0xXXXX
,然后您可以将其存储到 std::wstring
或 std::u16string
根据需要。
例如:
std::string a = "\u00c1\u00c4\u00d3";
std::wstring ws;
ws.reserve(a.size());
for(size_t i = 0; i < a.size();)
{
char ch = a[i++];
if ((ch == '\') && (i < a.size()) && (a[i] == 'u'))
{
wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(++i, 4), nullptr, 16));
i += 4;
ws.push_back(wc);
}
else
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.push_back(static_cast<wchar_t>(ch));
}
}
或者:
std::string a = "\u00c1\u00c4\u00d3";
std::wstring ws;
ws.reserve(a.size());
size_t start = 0;
do
{
size_t found = a.find("\u", start);
if (found == std::string::npos) break;
if (start < found)
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.insert(ws.end(), a.begin()+start, a.begin()+found);
}
wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(found+2, 4), nullptr, 16));
ws.push_back(wc);
start = found + 6;
}
while (true);
if (start < a.size())
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.insert(ws.end(), a.begin()+start, a.end());
}
否则,请使用已经为您进行此类翻译的第 3 方库。
我有 std::string
存储以 UTF 编码的字符。示例:
std::string a = "\u00c1\u00c4\u00d3";
注意a
的长度为18(3个字符,每个UTF字符6个ASCII符号)。
问题:如何将a
转换成只有3个字符的C++字符串?是否有任何标准函数(库)可以做到这一点?
标准 C++ 库中没有任何东西可以自动为您处理这种转换。您将不得不自己解析此字符串,手动将每个 6-char "\uXXXX"
子字符串转换为 1-wchar 值 0xXXXX
,然后您可以将其存储到 std::wstring
或 std::u16string
根据需要。
例如:
std::string a = "\u00c1\u00c4\u00d3";
std::wstring ws;
ws.reserve(a.size());
for(size_t i = 0; i < a.size();)
{
char ch = a[i++];
if ((ch == '\') && (i < a.size()) && (a[i] == 'u'))
{
wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(++i, 4), nullptr, 16));
i += 4;
ws.push_back(wc);
}
else
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.push_back(static_cast<wchar_t>(ch));
}
}
或者:
std::string a = "\u00c1\u00c4\u00d3";
std::wstring ws;
ws.reserve(a.size());
size_t start = 0;
do
{
size_t found = a.find("\u", start);
if (found == std::string::npos) break;
if (start < found)
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.insert(ws.end(), a.begin()+start, a.begin()+found);
}
wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(found+2, 4), nullptr, 16));
ws.push_back(wc);
start = found + 6;
}
while (true);
if (start < a.size())
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.insert(ws.end(), a.begin()+start, a.end());
}
否则,请使用已经为您进行此类翻译的第 3 方库。