找到后跟一个子串但不跟另一个子串的子串模式

Question

我有医院数据（opcs、NHS），其中包含程序代码，后跟代码以指示偏侧性。

使用 Regex 和 R，我想在一个字符串中识别一个过程代码，然后是其他过程代码，然后是偏侧代码。

但是，匹配项不得包含感兴趣的程序代码，后者后跟不同的偏侧代码。示例：

string <- ("W100 Z923 W200 A456 W200 B234 A234 Z921")

我要匹配的是："W100|W200"

后面必须跟什么："Z921" 例如应该匹配这个 W200 B234 A234 Z921

但后面不能跟："Z922|Z923" 例如不应匹配此 W100 Z923 W200 A456 W200 B234 A234 Z921

我尝试过的：

#match the procedure follow by Z921: 
(W100|W200).{1,}?Z941 

# I do not know how to add a negative look back to exclude matches without stopping this working, I have tried this, but it fails:
((W100|W200).{1,}Z941) (?<!Z943|Z942)

编辑：提高了问题和表示的清晰度

Answer 1

您可以使用

library(stringr)
str_extract_all(x, "\bW[12]00\b(?!\s+Z92[23]\b).*?Z941")

见regex demo。详情:

\b - 单词边界
W[12]00 - W100 或 W200
\b - 单词边界
(?!\s+Z92[23]\b) - 如果存在零个或多个空格然后 Z923 或 Z922 作为一个完整的单词
.*? - 任何零个或多个字符，换行字符除外，尽可能少
Z941 - Z941 字符串。

找到后跟一个子串但不跟另一个子串的子串模式

find a substring pattern followed by one substring but not by another substring

regex

r

regex-lookarounds