正则表达式（regex lookarounds）检测不在特定字符串之间的特定字符串（先行和后行，单词不被单词包围）

Question

我试图检测某个字符串的所有出现，该字符串未被某些字符串包围（使用正则表达式环视）。例如。 "African" 但不是 "South African Society" 的所有出现。请参阅下面的简化示例。

#My example text:
text <- c("South African Society", "South African", 
"African Society", "South African Society and African Society")

#My code examples:
str_detect(text, "(?<!South )African(?! Society)")
#or
grepl("(?<!South )African(?! Society)",  perl=TRUE , text)

#I need:
[1] FALSE TRUE TRUE TRUE 

#instead of:
[1] FALSE FALSE FALSE FALSE

问题似乎是正则表达式分别评估后瞻和前瞻，而不是作为一个整体。它应该需要两个条件，而不仅仅是一个。

Answer 1

当 (?<!South )African(?! Society) 模式前面没有 South 或 Society 时，它匹配 African。如果有 South 或 Society 则不会匹配。

有几种解决方案。

 African(?<!South African(?= Society))

参见 regex demo. Here, African is only matched when the regex engine does not find South African at the position after matching African substring that is immediately followed with space and Society. Using this check after African is more efficient in case there are longer strings that do not match the pattern than moving it before the word African (see the (?<!South (?=African Society))African regex demo)。

或者，您可以使用 SKIP-FAIL technique:

South African Society(*SKIP)(*F)|African

参见 another regex demo。这里先匹配South African Society，(*SKIP)(*F)使得本次匹配失败，继续进行下一次匹配，所以African在除South African Society.[=32以外的所有上下文中都匹配=]

正则表达式（regex lookarounds）检测不在特定字符串之间的特定字符串（先行和后行，单词不被单词包围）

Regular expression (regex lookarounds) to detected a certain string not between certain strings (lookahead & lookbehind, word not surrounded by words)

regex

r

lookahead

lookbehind

regex-lookarounds