如何从 Little-Endian UTF-16 编码字节中获取 C++ std::string
How to get C++ std::string from Little-Endian UTF-16 encoded bytes
我有一个第 3 方设备,它通过没有详细记录的专有通信协议与我的 Linux 盒子通信。一些数据包传送 "strings",在读取 this Joel On Software article 后,似乎是 UTF16 Little-Endian 编码。换句话说,在收到此类数据包后,我 Linux 盒子上的东西是
// The string "Out"
unsigned char data1[] = {0x4f, 0x00, 0x75, 0x00, 0x74, 0x00, 0x00, 0x00};
// The string "°F"
unsigned char data2[] = {0xb0, 0x00, 0x46, 0x00, 0x00, 0x00};
据我了解,我不能将它们视为 std::wstring
,因为在 Linux 上,wchar_t
是 4 个字节。但是,我确实有一件事情对我有利,因为我的 Linux 框也是 Little-Endian。所以,我相信我需要使用像 std::codecvt_utf8_utf16<char16_t>
这样的东西。然而,即使在阅读了 the documentation 之后,我仍然无法弄清楚如何从 unsigned char[]
实际转到 std::string
。有人可以帮忙吗?
如果您想使用 std::codcvt(自 C++ 17 起已弃用),您可以包装 UTF-16 文本,然后根据需要将其转换为 UTF-8。
即
// simply cast raw data for constructor, since we known that char
// is actually 'byte' array from network API
std::u16string u16_str( reinterpret_cast<const char16_t*>(data2) );
// UTF-16/char16_t to UTF-8
std::string u8_conv = std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t>{}.to_bytes(u16_str);
为了完整起见,这是我想出的最简单的基于 iconv
的转换
#include <iconv.h>
auto iconv_eng = ::iconv_open("UTF-8", "UTF-16LE");
if (reinterpret_cast<::iconv_t>(-1) == iconv_eng)
{
std::cerr << "Unable to create ICONV engine: " << strerror(errno) << std::endl;
}
else
{
// src a char * to utf16 bytes
// src_size the maximum number of bytes to convert
// dest a char * to utf8 bytes to generate
// dest_size the maximum number of bytes to write
if (static_cast<std::size_t>(-1) == ::iconv(iconv_eng, &src, &src_size, &dest, &dest_size))
{
std::cerr << "Unable to convert from UTF16: " << strerror(errno) << std::endl;
}
else
{
std::string utf8_str(src);
::iconv_close(iconv_eng);
}
}
我有一个第 3 方设备,它通过没有详细记录的专有通信协议与我的 Linux 盒子通信。一些数据包传送 "strings",在读取 this Joel On Software article 后,似乎是 UTF16 Little-Endian 编码。换句话说,在收到此类数据包后,我 Linux 盒子上的东西是
// The string "Out"
unsigned char data1[] = {0x4f, 0x00, 0x75, 0x00, 0x74, 0x00, 0x00, 0x00};
// The string "°F"
unsigned char data2[] = {0xb0, 0x00, 0x46, 0x00, 0x00, 0x00};
据我了解,我不能将它们视为 std::wstring
,因为在 Linux 上,wchar_t
是 4 个字节。但是,我确实有一件事情对我有利,因为我的 Linux 框也是 Little-Endian。所以,我相信我需要使用像 std::codecvt_utf8_utf16<char16_t>
这样的东西。然而,即使在阅读了 the documentation 之后,我仍然无法弄清楚如何从 unsigned char[]
实际转到 std::string
。有人可以帮忙吗?
如果您想使用 std::codcvt(自 C++ 17 起已弃用),您可以包装 UTF-16 文本,然后根据需要将其转换为 UTF-8。
即
// simply cast raw data for constructor, since we known that char
// is actually 'byte' array from network API
std::u16string u16_str( reinterpret_cast<const char16_t*>(data2) );
// UTF-16/char16_t to UTF-8
std::string u8_conv = std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t>{}.to_bytes(u16_str);
为了完整起见,这是我想出的最简单的基于 iconv
的转换
#include <iconv.h>
auto iconv_eng = ::iconv_open("UTF-8", "UTF-16LE");
if (reinterpret_cast<::iconv_t>(-1) == iconv_eng)
{
std::cerr << "Unable to create ICONV engine: " << strerror(errno) << std::endl;
}
else
{
// src a char * to utf16 bytes
// src_size the maximum number of bytes to convert
// dest a char * to utf8 bytes to generate
// dest_size the maximum number of bytes to write
if (static_cast<std::size_t>(-1) == ::iconv(iconv_eng, &src, &src_size, &dest, &dest_size))
{
std::cerr << "Unable to convert from UTF16: " << strerror(errno) << std::endl;
}
else
{
std::string utf8_str(src);
::iconv_close(iconv_eng);
}
}