德语变音符号和正则表达式

Question

这种奇怪的现象我遇到过好几次了。如果我使用 ifstream 为程序提供文件内容并对输入的单词应用正则表达式，德语字母 ä ö ü 会给我带来一些困难。如果这些字母中的任何一个出现在单词的开头，正则表达式将无法识别它们，但如果这些字母中的任何一个出现在单词中则不会。所以这些行

string word = "über";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war genau über ihm.";

将不起作用，因为正则表达式无法在字符串搜索中找到 über。然而，

string word = "für";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war für ihn.";

会起作用，因为 ü 出现在单词中。为什么会这样，我该如何解决？我考虑过用 ue 替换每个 ü，用 ae 替换每个 ä，用 oe 替换每个 ö，然后撤消替换，但是还有另一种可能性吗？我正在使用 Visual Studio 2015.

Answer 1

改用regex check {"(^|[\x60\x00-\x2f\x3a-\x40\x5b-\x5e\x7b-\x7e])über($|[\x60\x00-\x2f\x3a-\x40\x5b-\x5e\x7b-\x7e])", regex_constants::icase};。

C++ 正则表达式的默认语法类似于 JavaScript。 \b doesn't support Unicode.

And from microsoft.com:

Word Boundary

A word boundary occurs in the following situations:

The current character is at the beginning of the target sequence and is one of the word characters A-Za-z0-9_.

The current character position is past the end of the target sequence and the last character in the target sequence is one of the word characters.

The current character is one of the word characters and the preceding character is not.

The current character is not one of the word characters and the preceding character is.

德语变音符号和正则表达式

German Umlaute and Regular Expressions

c++

regex

letters