德语变音符号和正则表达式

German Umlaute and Regular Expressions

这种奇怪的现象我遇到过好几次了。如果我使用 ifstream 为程序提供文件内容并对输入的单词应用正则表达式,德语字母 ä ö ü 会给我带来一些困难。如果这些字母中的任何一个出现在单词的开头,正则表达式将无法识别它们,但如果这些字母中的任何一个出现在单词中则不会。所以这些行

string word = "über";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war genau über ihm.";

将不起作用,因为正则表达式无法在字符串搜索中找到 über。然而,

string word = "für";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war für ihn.";

会起作用,因为 ü 出现在单词中。为什么会这样,我该如何解决?我考虑过用 ue 替换每个 ü,用 ae 替换每个 ä,用 oe 替换每个 ö,然后撤消替换,但是还有另一种可能性吗?我正在使用 Visual Studio 2015.

改用regex check {"(^|[\x60\x00-\x2f\x3a-\x40\x5b-\x5e\x7b-\x7e])über($|[\x60\x00-\x2f\x3a-\x40\x5b-\x5e\x7b-\x7e])", regex_constants::icase};

C++ 正则表达式的默认语法类似于 JavaScript。 \b doesn't support Unicode.

And from microsoft.com:

Word Boundary

A word boundary occurs in the following situations:

  • The current character is at the beginning of the target sequence and is one of the word characters A-Za-z0-9_.

  • The current character position is past the end of the target sequence and the last character in the target sequence is one of the word characters.

  • The current character is one of the word characters and the preceding character is not.

  • The current character is not one of the word characters and the preceding character is.