将 unicode std::string 与通常的“文字或 u8”声明进行比较

Question

在 Windows 与 Visual Studio 2015

    // Ü
    //    UTF-8  (hex) 0xC3 0x9C 
    //    UTF-16 (hex) 0x00DC 
    //    UTF-32 (hex) 0x000000DC 

    using namespace std::string_literals;
    const auto narrow_multibyte_string_s = "\u00dc"s;
    const auto wide_string_s             = L"\u00dc"s;
    const auto utf8_encoded_string_s     = u8"\u00dc"s;
    const auto utf16_encoded_string_s    = u"\u00dc"s;
    const auto utf32_encoded_string_s    = U"\u00dc"s;

    assert(utf8_encoded_string_s     == "\xC3\x9C");
    assert(narrow_multibyte_string_s ==        "Ü");
    assert(utf8_encoded_string_s     ==      u8"Ü");

    // here is the question
    assert(utf8_encoded_string_s != narrow_multibyte_string_s);

"\u00dc"s is not the same as u8"\u00dc"s or "Ü"s is not the same as u8"Ü"s

显然，通常字符串文字的默认编码不是 UTF-8（可能是 UTF-16），我不能在不知道编码的情况下比较两个 std::string，即使它们具有相同的语义。

在启用 unicode 的 C++ 应用程序开发中执行此类字符串比较的做法是什么？

例如 API 这样的：

  class MyDatabase
  {
      bool isAvailable(const std::string& key)
      {
         // *compare*  key in database
         if (key == "Ü")
           return true;
         else
           return false;
      }
  }

其他程序可以使用 std::string 以 UTF-8 或默认（UTF-16？）编码调用 isAvailable。我怎样才能保证进行正确的比较？

我可以在编译时检测到任何编码不匹配吗？

注意：我更喜欢 C++11/14 的东西。比 std::wstring

更喜欢 std::string

Answer 1

"\u00dc" 是一个 char[] 编码，无论 compiler/OS 的默认 8 位编码恰好是什么，所以它在不同的机器上可能不同。在 Windows 上，这往往是 OS 的默认 Ansi 编码，或者它可能是源文件保存的编码。

L"\u00dc" 是使用 UTF-16 或 UTF-32 编码的 wchar_t[]，具体取决于编译器对 wchar_t 的定义（在 [=56= 上是 16 位） ], 所以是 UTF-16).

u8"\u00dc" 是以 UTF-8 编码的 char[]。

u"\u00dc" 是以 UTF-16 编码的 char16_t[]。

U"\u00dc" 是以 UTF-32 编码的 char32_t[]。

""s 后缀只需 returns 一个 std::string、std::wstring、std::u16string 或 std::u32string，具体取决于 char[]、wchar_t[]、char16_t[] 或 char32_t[] 传递给它。

比较两个字符串时，首先确保它们的编码相同。这对您的 char[]/std::string 数据尤其重要，因为它可能采用任意数量的 8 位编码，具体取决于所涉及的系统。如果应用程序本身生成字符串，这不是什么大问题，但如果一个或多个字符串来自外部来源（文件、用户输入、网络协议等），这就很重要了。

在您的示例中，"\u00dc" 和 "Ü" 不一定保证产生相同的 char[] 序列，具体取决于编译器如何解释这些不同的文字。但即使他们这样做了（在你的例子中似乎是这种情况），他们都不可能产生 UTF-8（你必须采取额外的措施来强制它），这就是为什么你与 utf8_encoded_string_s失败。

因此，如果您希望字符串文字为 UTF-8，请使用 u8"" 来确保这一点。如果您从外部源获取字符串数据并且需要它是 UTF-8 格式，请尽快在代码中将其转换为 UTF-8，如果尚未转换（这意味着您必须知道外部来源）。

将 unicode std::string 与通常的“文字或 u8”声明进行比较

Compare unicode std::string with usual "" literal or u8"" declartion

unicode

utf-8

stdstring

string-literals

c++11