无法从 C++ 中提取 Unicode 符号 std::string
Unable to extract Unicode symbols from C++ std::string
我正在阅读 C++ std::string,然后将 std::string 传递给一个函数,该函数将对其进行分析,然后从中提取 Unicode 符号和简单的 ASCII 符号。
网上搜了很多教程,都提到标准C++并不完全支持Unicode格式。他们中的许多人提到使用 ICU C++.
这是我的 C++ 程序,用于理解上述功能的最基本功能。
它读取原始字符串,转换为 ICU Unicode 字符串并打印:
#include <iostream>
#include <string>
#include "unicode/unistr.h"
int main()
{
std::string s="Hello☺";
// at this point s contains a line of text
// which may be ANSI or UTF-8 encoded
// convert std::string to ICU's UnicodeString
icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));
// convert UnicodeString to std::wstring
std::wstring ws;
for (int i = 0; i < ucs.length(); ++i)
ws += static_cast<wchar_t>(ucs[i]);
std::wcout << ws << std::endl;
}
预期输出:
Hello☺
实际输出:
Hello?
请指出我做错了什么。还建议任何 alternative/simpler 方法
谢谢
Update 1 (Older): 工作代码如下:
#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
void f(const std::string & s)
{
std::wcout << "Inside called function" << std::endl;
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
std::ios_base::sync_with_stdio(false);
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());
// at this point s contains a line of text which may be ANSI or UTF-8 encoded
// convert std::string to ICU's UnicodeString
icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));
// convert UnicodeString to std::wstring
std::wstring ws;
for (int i = 0; i < ucs.length(); ++i)
ws += static_cast<wchar_t>(ucs[i]);
std::wcout << ws << std::endl;
}
int main()
{
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
std::ios_base::sync_with_stdio(false);
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());
std::wcout << "Inside main function" << std::endl;
std::string s=u8"hello☺";
// at this point s contains a line of text which may be ANSI or UTF-8 encoded
// convert std::string to ICU's UnicodeString
icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));
// convert UnicodeString to std::wstring
std::wstring ws;
for (int i = 0; i < ucs.length(); ++i)
ws += static_cast<wchar_t>(ucs[i]);
std::wcout << ws << std::endl;
std::wcout << "--------------------------------" << std::endl;
f(s);
return 0;
}
现在,预期输出和实际输出相同,即:
Inside main function
hello☺
--------------------------------
Inside called function
hello☺
更新 2(最新):更新 1 中提到的代码不适用于 UTF32 符号,例如 .因此,所有可能的 Unicode 符号的工作代码如下。特别感谢 @Botje 的解决方案。我希望我能给他的解决方案打上不止一个勾!!! :)
#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"
void f(const std::u32string & s)
{
std::wcout << "INSIDE CALLED FUNCTION:" << std::endl;
icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
std::cout << "Unicode string is: " << ustr << std::endl;
std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;
std::cout << "Individual characters of the string are:" << std::endl;
for(int i=0; i < ustr.countChar32(); i++)
std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;
std::cout << "--------------------------------" << std::endl;
}
int main()
{
std::cout << "--------------------------------" << std::endl;
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
std::ios_base::sync_with_stdio(false);
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());
std::wcout << "INSIDE MAIN FUNCTION:" << std::endl;
std::u32string s=U"hello☺";
icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
std::cout << "Unicode string is: " << ustr << std::endl;
std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;
std::cout << "Individual characters of the string are:" << std::endl;
for(int i=0; i < ustr.countChar32(); i++)
std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;
std::cout << "--------------------------------" << std::endl;
f(s);
return 0;
}
现在,预期输出和实际输出相同,即:
--------------------------------
INSIDE MAIN FUNCTION:
Unicode string is: hello☺
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
--------------------------------
INSIDE CALLED FUNCTION:
Unicode string is: hello☺
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
--------------------------------
要做到这一点,有许多障碍:
- 首先,您的文件(以及其中的笑脸)应编码为 UTF-8。笑脸应由文字字节
0xE2 0x98 0xBA
. 组成
- 您应该使用
u8
装饰器将字符串标记为包含 UTF-8 数据:u8"Hello☺"
- 接下来,
icu::UnicodeString
的文档指出它将 Unicode 存储为 UTF-16。在这种情况下你很幸运,因为 U+263A 适合一个 UTF-16 字符。其他表情符号可能不会!您应该将其转换为 UTF-32,或者非常小心地使用 GetChar32At
函数。
- 最后,
wcout
使用的编码应该配置为 imbue
以匹配您的环境期望的编码。查看 this question. 的答案
我正在阅读 C++ std::string,然后将 std::string 传递给一个函数,该函数将对其进行分析,然后从中提取 Unicode 符号和简单的 ASCII 符号。
网上搜了很多教程,都提到标准C++并不完全支持Unicode格式。他们中的许多人提到使用 ICU C++.
这是我的 C++ 程序,用于理解上述功能的最基本功能。 它读取原始字符串,转换为 ICU Unicode 字符串并打印:
#include <iostream>
#include <string>
#include "unicode/unistr.h"
int main()
{
std::string s="Hello☺";
// at this point s contains a line of text
// which may be ANSI or UTF-8 encoded
// convert std::string to ICU's UnicodeString
icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));
// convert UnicodeString to std::wstring
std::wstring ws;
for (int i = 0; i < ucs.length(); ++i)
ws += static_cast<wchar_t>(ucs[i]);
std::wcout << ws << std::endl;
}
预期输出:
Hello☺
实际输出:
Hello?
请指出我做错了什么。还建议任何 alternative/simpler 方法
谢谢
Update 1 (Older): 工作代码如下:
#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
void f(const std::string & s)
{
std::wcout << "Inside called function" << std::endl;
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
std::ios_base::sync_with_stdio(false);
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());
// at this point s contains a line of text which may be ANSI or UTF-8 encoded
// convert std::string to ICU's UnicodeString
icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));
// convert UnicodeString to std::wstring
std::wstring ws;
for (int i = 0; i < ucs.length(); ++i)
ws += static_cast<wchar_t>(ucs[i]);
std::wcout << ws << std::endl;
}
int main()
{
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
std::ios_base::sync_with_stdio(false);
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());
std::wcout << "Inside main function" << std::endl;
std::string s=u8"hello☺";
// at this point s contains a line of text which may be ANSI or UTF-8 encoded
// convert std::string to ICU's UnicodeString
icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));
// convert UnicodeString to std::wstring
std::wstring ws;
for (int i = 0; i < ucs.length(); ++i)
ws += static_cast<wchar_t>(ucs[i]);
std::wcout << ws << std::endl;
std::wcout << "--------------------------------" << std::endl;
f(s);
return 0;
}
现在,预期输出和实际输出相同,即:
Inside main function
hello☺
--------------------------------
Inside called function
hello☺
更新 2(最新):更新 1 中提到的代码不适用于 UTF32 符号,例如 .因此,所有可能的 Unicode 符号的工作代码如下。特别感谢 @Botje 的解决方案。我希望我能给他的解决方案打上不止一个勾!!! :)
#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"
void f(const std::u32string & s)
{
std::wcout << "INSIDE CALLED FUNCTION:" << std::endl;
icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
std::cout << "Unicode string is: " << ustr << std::endl;
std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;
std::cout << "Individual characters of the string are:" << std::endl;
for(int i=0; i < ustr.countChar32(); i++)
std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;
std::cout << "--------------------------------" << std::endl;
}
int main()
{
std::cout << "--------------------------------" << std::endl;
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
std::ios_base::sync_with_stdio(false);
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());
std::wcout << "INSIDE MAIN FUNCTION:" << std::endl;
std::u32string s=U"hello☺";
icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
std::cout << "Unicode string is: " << ustr << std::endl;
std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;
std::cout << "Individual characters of the string are:" << std::endl;
for(int i=0; i < ustr.countChar32(); i++)
std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;
std::cout << "--------------------------------" << std::endl;
f(s);
return 0;
}
现在,预期输出和实际输出相同,即:
--------------------------------
INSIDE MAIN FUNCTION:
Unicode string is: hello☺
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
--------------------------------
INSIDE CALLED FUNCTION:
Unicode string is: hello☺
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
--------------------------------
要做到这一点,有许多障碍:
- 首先,您的文件(以及其中的笑脸)应编码为 UTF-8。笑脸应由文字字节
0xE2 0x98 0xBA
. 组成
- 您应该使用
u8
装饰器将字符串标记为包含 UTF-8 数据:u8"Hello☺"
- 接下来,
icu::UnicodeString
的文档指出它将 Unicode 存储为 UTF-16。在这种情况下你很幸运,因为 U+263A 适合一个 UTF-16 字符。其他表情符号可能不会!您应该将其转换为 UTF-32,或者非常小心地使用GetChar32At
函数。 - 最后,
wcout
使用的编码应该配置为imbue
以匹配您的环境期望的编码。查看 this question. 的答案