`std::wregex` 支持 utf-16/unicode 还是仅支持 UCS-2?
Does `std::wregex` support utf-16/unicode or only UCS-2?
随 c++11 the regex library 被引入标准库。
在 Windows/MSVC 平台上 wchar_t
的大小为 2(16 位)并且 wchar_t*
在与 system/platform 交互时通常是 utf-16(例如.CreateFileW
).
不过好像std::regex
不是utf-8或者不支持,所以我想知道std::wregex
是否支持utf-16 或者只是 ucs2 ?
我没有在文档中找到任何关于此(Unicode 或类似)的提及。在其他语言中,规范化发生。
问题是:
当wchar_t
的大小为2时,std::wregex
是否代表ucs2?
C++ 标准不对 std::string
and std::wstring
强制执行任何编码。它们只是一系列 CharT
。只有 std::u8string
、std::u16string
和 std::u32string
定义了编码
- What encoding does std::string.c_str() use?
同理std::regex
and std::wregex
also wrap around std::basic_string
and CharT
. Their constructors accept std::basic_string
and the encoding being used for std::basic_string
will also be used for std::basic_regex
. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex
and std::string
will be UTF-8 (yes, modern Windows does support UTF-8 locale)
On Windows std::wstring
使用 UTF-16 所以 std::wregex
也使用 UTF-16。 UCS-2 已弃用,没有人再使用它。你甚至不需要区分它们,因为 UCS-2 只是 UTF-16 的一个子集,除非你使用一些非常古老的工具来切入代理对的中间。 UTF-16 中的字符串搜索与 UCS-2 中的字符串搜索完全相同,因为 并且合适的针线永远无法从大海捞针中匹配。与 UTF-8 相同。如果该工具不理解 UTF-16,那么它很可能也不知道 UTF-8 是可变长度的,并且会在中间截断 UTF-8
Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
您唯一需要注意的是:避免在字符中间截断,必要时在匹配前规范化字符串。如果您从不在字符 class 中使用 BMP 之外的字符(如注释),则可以在仅限 UCS-2 的正则表达式引擎中避免前一个问题。将它们替换为一组
In other languages normalization takes place.
这是错误的。某些语言可能会在匹配正则表达式之前进行规范化,但这绝对不适用于所有“其他语言”
如果您想要多一点保证,请分别对 UTF-8 和 UTF-16 使用 std::basic_regex<char8_t>
和 std::basic_regex<char16_t>
。你仍然需要一个支持 UTF-16 的库,否则它仍然只适用于只包含单词的正则表达式字符串
更好的解决方案可能是为每个库更改为另一个库,例如 ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support
相关:
- Do C++11 regular expressions work with UTF-8 strings?
- How well is Unicode supported in C++11?
另见
随 c++11 the regex library 被引入标准库。
在 Windows/MSVC 平台上 wchar_t
的大小为 2(16 位)并且 wchar_t*
在与 system/platform 交互时通常是 utf-16(例如.CreateFileW
).
不过好像std::regex
不是utf-8或者不支持,所以我想知道std::wregex
是否支持utf-16 或者只是 ucs2 ?
我没有在文档中找到任何关于此(Unicode 或类似)的提及。在其他语言中,规范化发生。
问题是:
当wchar_t
的大小为2时,std::wregex
是否代表ucs2?
C++ 标准不对 std::string
and std::wstring
强制执行任何编码。它们只是一系列 CharT
。只有 std::u8string
、std::u16string
和 std::u32string
定义了编码
- What encoding does std::string.c_str() use?
同理std::regex
and std::wregex
also wrap around std::basic_string
and CharT
. Their constructors accept std::basic_string
and the encoding being used for std::basic_string
will also be used for std::basic_regex
. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex
and std::string
will be UTF-8 (yes, modern Windows does support UTF-8 locale)
On Windows std::wstring
使用 UTF-16 所以 std::wregex
也使用 UTF-16。 UCS-2 已弃用,没有人再使用它。你甚至不需要区分它们,因为 UCS-2 只是 UTF-16 的一个子集,除非你使用一些非常古老的工具来切入代理对的中间。 UTF-16 中的字符串搜索与 UCS-2 中的字符串搜索完全相同,因为
Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
您唯一需要注意的是:避免在字符中间截断,必要时在匹配前规范化字符串。如果您从不在字符 class 中使用 BMP 之外的字符(如注释),则可以在仅限 UCS-2 的正则表达式引擎中避免前一个问题。将它们替换为一组
In other languages normalization takes place.
这是错误的。某些语言可能会在匹配正则表达式之前进行规范化,但这绝对不适用于所有“其他语言”
如果您想要多一点保证,请分别对 UTF-8 和 UTF-16 使用 std::basic_regex<char8_t>
和 std::basic_regex<char16_t>
。你仍然需要一个支持 UTF-16 的库,否则它仍然只适用于只包含单词的正则表达式字符串
更好的解决方案可能是为每个库更改为另一个库,例如 ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support
相关:
- Do C++11 regular expressions work with UTF-8 strings?
- How well is Unicode supported in C++11?
另见