C++ Utf-8 转换使用 atlconv.h / W2A 和中文文本

Question

我正在执行 wchar_t* 到 UTF-8 的转换，如下所示：

char* DupString(wchar_t* t)
{ 
    if(!t) return strdup("");
    USES_CONVERSION;
    _acp = CP_UTF8;
    return strdup(W2A(t));
}

通常它工作正常，但现在我找到了一个中文文本“主体” - 无法正确转换。

宏本身是这样定义的：

#define W2A(lpw) (\
    ((_lpw = lpw) == NULL) ? NULL : (\
        (_convert = (lstrlenW(_lpw)+1), \
        (_convert>INT_MAX/2) ? NULL : \
        ATLW2AHELPER((LPSTR) alloca(_convert*sizeof(WCHAR)), _lpw, _convert*sizeof(WCHAR), _acp))))

在我的例子中，_convert = 2 + 1 = 3。当传递给函数调用时 3 * sizeof(WCHAR) = 6.

在 atlconv.h / AtlW2AHelper - 它进入 WideCharToMultiByte 并 ret == 0.

_Ret_opt_z_cap_(nChars) inline LPSTR WINAPI AtlW2AHelper(
    _Out_opt_z_cap_(nChars) LPSTR lpa, 
    _In_opt_z_ LPCWSTR lpw, 
    _In_ int nChars, 
    _In_ UINT acp) throw()
{
    ATLASSERT(lpw != NULL);
    ATLASSERT(lpa != NULL);
    if (lpa == NULL || lpw == NULL)
        return NULL;
    // verify that no illegal character present
    // since lpa was allocated based on the size of lpw
    // don't worry about the number of chars
    *lpa = '[=13=]';
    int ret = WideCharToMultiByte(acp, 0, lpw, -1, lpa, nChars, NULL, NULL);
    if(ret == 0)
    {
        ATLASSERT(FALSE);
        return NULL;
    }
    return lpa;
}

@err 在 Watch windows 中显示错误代码 122 = ERROR_INSUFFICIENT_BUFFER.

我尝试将缓冲区增加一个字节 - nChars = 7 - 然后转换成功。缓冲区填充了 6 个字节 + 1 个 ascii 零终止 - 所以填充了 7 个字节。

这是 W2A 宏的错误吗 - 没有考虑 ascii 零？

有没有人遇到过类似的问题？

作为平台，我使用的是 visual studio 2010，不确定其他 visual studio 平台是否也存在问题。

在某些头文件中，此问题似乎已得到解决 - 例如此处：

https://github.com/kxproject/kx-audio-driver/blob/master/h/gui/kDefs.h

但它适用于某些第 3 方项目，而不是 Visual studio 本身。

Answer 1

W2A 错误地假设每个字符两个字节的缓冲区足以进行转换。您的字符串扩展为七个字节的 UTF-8 字符串，包括终止零。 WideCharToMultiByte 因缓冲区不足而失败 - 这是您已经找到的。

它看起来像一个错误，您可以在 atlconv.h 中自行修复 ATL 源代码（Microsoft 不会更新 VS 2010，我想即使是 2015 年更新也可能已经晚了）：

#define W2A(lpw) (\
    ((_lpw = lpw) == NULL) ? NULL : (\
        (_convert = (static_cast<int>(wcslen(_lpw))+1), \
        (_convert>INT_MAX/2) ? NULL : \
        ATLW2AHELPER((LPSTR) alloca(_convert*sizeof(WCHAR)), _lpw, _convert*4, _acp)))) //sizeof(WCHAR), _acp))))

或者您可以使用较新的 CW2A 转换宏，它已经分配了更大的缓冲区（每个字符 4 个字节，参见 CW2AEX::Init）：

static const LPCWSTR g_psz = L"主体";
LPCSTR psz = _strdup(CW2A(g_psz, CP_UTF8));

Answer 2

从 Microsoft 论坛复制粘贴，从这里：

https://social.msdn.microsoft.com/Forums/en-US/262e7b83-8cf4-45ed-a3db-5dc6064612f2/c-utf8-conversion-using-atlconvh-w2a-and-chinese-texts?forum=vcgeneral&prof=required

Have you considered using the improved ATL7 macro? https://msdn.microsoft.com/en-us/library/87zae4a3.aspx#atl70stringconversionclassesmacros
CW2A pA( pW, CP_UTF8 );
This seems to assume 4 bytes max per Unicode character, rather than 2 that the old one does.

这似乎是字符串的更好用法，因为CW2A的析构函数会释放转换缓冲区。

 wchar_t* pStr = NULL;
 {
     CW2A pA( pW, CP_UTF8 );

     pStr = pA;
     // pStr is valid
 }
 // pStr is invalid

C++ Utf-8 转换使用 atlconv.h / W2A 和中文文本

C++ Utf-8 conversion using atlconv.h / W2A and Chinese texts

c++

unicode

atl

visual-studio-2010

visual-studio