Visual Studio 2017 std::experimental::filesystem::path 中的 UTF-8 支持

UTF-8 support in Visual Studio 2017 std::experimental::filesystem::path

我很高兴看到在 Visual Studio 2017 年增加了对 std::experimental::filesystem 的支持,但现在 运行 遇到了 Unicode 问题。我有点盲目地假设我可以在任何地方使用 UTF-8 字符串,但失败了——当从 char* 构造 std::experimental::filesystem::path 到 UTF-8 编码字符串时,没有发生转换(即使 headers 内部使用_To_wide_To_byte函数,我写了一个简单的测试例子:

#include <string>
#include <experimental\filesystem>

#define WIN32_LEAN_AND_MEAN
#include <Windows.h>

static inline std::string FromUtf16(const wchar_t* pUtf16String)
{
    int nUtf16StringLength = static_cast<int>(wcslen(pUtf16String));
    int nUtf8StringLength = ::WideCharToMultiByte(CP_UTF8, 0, pUtf16String, nUtf16StringLength, NULL, 0, NULL, NULL);
    std::string sUtf8String(nUtf8StringLength, '[=12=]');
    nUtf8StringLength = ::WideCharToMultiByte(CP_UTF8, 0, pUtf16String, nUtf16StringLength, const_cast<char *>(sUtf8String.c_str()), nUtf8StringLength, NULL, NULL);
    return sUtf8String;
}

static inline std::string FromUtf16(const std::wstring& sUtf16String)
{
    return FromUtf16(sUtf16String.c_str());
}

static inline std::wstring ToUtf16(const char* pUtf8String)
{
    int nUtf8StringLength = static_cast<int>(strlen(pUtf8String));
    int nUtf16StringLength = ::MultiByteToWideChar(CP_UTF8, 0, pUtf8String, nUtf8StringLength, NULL, NULL);
    std::wstring sUtf16String(nUtf16StringLength, '[=12=]');
    nUtf16StringLength = ::MultiByteToWideChar(CP_UTF8, 0, pUtf8String, nUtf8StringLength, const_cast<wchar_t*>(sUtf16String.c_str()), nUtf16StringLength);
    return sUtf16String;
}

static inline std::wstring ToUtf16(const std::string& sUtf8String)
{
    return ToUtf16(sUtf8String.c_str());
}

int main(int argc, char** argv)
{
    std::string sTest(u8"Kaķis");
    std::wstring sWideTest(ToUtf16(sTest));
    wchar_t pWideTest[1024] = {};
    char pByteTest[1024];
    std::experimental::filesystem::path Path1(sTest), Path2(sWideTest);
    std::experimental::filesystem::v1::_To_wide(sTest.c_str(), pWideTest);
    bool bWideEqual = sWideTest == pWideTest;
    std::experimental::filesystem::v1::_To_byte(pWideTest, pByteTest);
    bool bUtf8Equal = sTest == pByteTest;
    bool bPathsEqual = Path1 == Path2;
    printf("wide equal: %d, utf-8 equal: %d, paths equal: %d\n", bWideEqual, bUtf8Equal, bPathsEqual);
}

但正如我之前所说,我只是盲目地假设 UTF-8 可以工作。查看构造函数部分下的 std::experimental::filesystem::path on cppreference.com 它实际上指出:

  • If the source character type is char, the encoding of the source is assumed to be the native narrow encoding (so no conversion takes place on POSIX systems)
  • If the source character type is char16_t, conversion from UTF-16 to native filesystem encoding is used.
  • If the source character type is char32_t, conversion from UTF-32 to native filesystem encoding is used.
  • If the source character type is wchar_t, the input is assumed to be the native wide encoding (so no conversion takes places on Windows)

我不确定如何解释第一行。首先,它仅说明了一些关于 POSIX 系统的内容(即使我不明白什么是本机窄编码,这是否意味着 UTF-8 也不能在 POSIX 上工作?)。其次,它没有说明任何关于 Windows 的内容,MSDN 对此也没有提及。那么,如何 属性 以 cross-platform 安全的方式处理来自 Unicode 字符的 std::experimental::filesystem::path 初始化?

filesystem::path 的 "narrow"(8 位)编码取决于环境和主机 OS。在许多 POSIX 系统上它可能是 UTF-8,但也可能不是。如果你想使用 UTF-8,你应该明确地使用它,通过 std::filesystem::path::u8string() and std::filesystem::u8path()