我需要十六进制字符串的 UTF8 编码表示，而不是 UTF16

Question

我需要获取以下十六进制值的 UTF8 表示形式，而不是 UTF16。我正在使用 C++ 构建器 11

setlocale(LC_ALL, ".UTF8");
String tb64 = UTF8String(U"D985");//Hex value of the letter م or M in arabic

 std::wstring hex;
for(int i =1; i < tb64.Length()+1; ++i)
        hex += tb64[i];

int len = hex.length();
std::wstring newString;
std::wstring byte;
String S;

for(int i=0; i< len; i+=4)
{

 byte = hex.substr(i,4);

 wchar_t  chr =( wchar_t ) ( int) wcstol(byte.c_str(), 0, 16);
     newString.push_back(chr);
     S = newString.c_str();
}

输出应该是 م 阿拉伯语中的 M 而不是垃圾

https://dencode.com/en/string?v=D985&oe=UTF-8&nl=crlf

Answer 1

您正在将十六进制字符串分配给 UTF8String，然后将其分配给 (Unicode)String，这会将 UTF-8 转换为 UTF-16。然后你从 UTF-16 字符创建一个单独的 std::wstring 。 std::wstring 在 Windows 上使用 UTF-16，在其他平台上使用 UTF-32。

所有这些字符串转换都是不必要的，因为您正在处理 ASCII 范围内的十六进制字符。所以只需按原样迭代原始十六进制字符串的字符，不需要转换。

无论如何，您都在尝试将每个 4 位十六进制序列直接解码为二进制 Unicode 代码点编号。但在这种情况下，代码点 U+D985 不是有效的 Unicode 字符。

"D985" 实际上是 Unicode 字符 م（代码点 U+0645）的十六进制编码 UTF-8 字节，所以你需要将每个 对 2 个十六进制数字 转换为单个字节，并将字节按原样存储到 UTF8String，而不是 std::wstring.

RTL 有一个 StrToInt() 函数，可以将十六进制编码的 UnicodeString 解码为整数，在这种情况下您可以将其视为一个字节。

尝试更像这样的东西：

String hex = _D("D985");
int len = hex.Length();

UTF8String utf8;
for(int i = 1; i <= len; i += 2) {
    utf8 += static_cast<char>(StrToInt(_D("0x") + hex.Substring(i, 2)));
}

/* alternatively:
UTF8String utf8;
utf8.SetLength(len / 2);

for(int i = 1, j = 1; i <= len; i += 2, ++j) {
    utf8[j] = static_cast<char>(StrToInt(_D("0x") + hex.Substring(i, 2)));
}
*/

// use utf8 as needed...

如果您需要将解码后的 UTF-8 转换为 UTF-16，只需将 UTF8String 按原样分配给 UnicodeString，例如：

UnicodeString utf16 = utf8;

或者，您也可以将解码后的字节存储到 TBytes and then use the GetString() method of TEncoding::UTF8 中，例如：

String hex = _D("D985");
int len = hex.Length();

TBytes utf8;
utf8.Length = len / 2;
for(int i = 1, j = 0; i <= len; i += 2, ++j) {
    utf8[j] = static_cast<System::Byte>(StrToInt(_D("0x") + hex.Substring(i, 2)));
}

UnicodeString utf16 = TEncoding::UTF8->GetString(utf8);
// use utf16 as needed...

我只是想到了一个稍微简单一点的解决方案——RTL也有一个HexToBin()函数，它可以在一次操作中将整个十六进制编码的字符串解码为一个完整的字节数组，例如：

String hex = _D("D985");

UTF8String utf8;
utf8.SetLength(hex.Length() / 2);
HexToBin(hex.c_str(), &utf8[1], utf8.Length());

/* or:
TBytes utf8;
utf8.Length = hex.Length() / 2;
HexToBin(hex.c_str(), &utf8[0], utf8.Length);
*/

// use utf8 as needed...

我需要十六进制字符串的 UTF8 编码表示，而不是 UTF16

I need UTF8 encoded representation of a hex string, not UTF16

c++

c++builder