R 中的负前瞻匹配不包含特定字符的字符串中的定界块

Question

我正在尝试（从字符串中）提取两个 \r\n 表达式之间不包含白色 space 的所有字符块。为此，我使用了负先行运算符。

这是我的字符串：

my_string <- "\r\nContent: base64\r\n\r\nDBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU\r\n"

这就是我尝试过的：

pat <- "\r\n+(?! )\r\n.*"

out <- unlist(regmatches(my_string,
                         regexpr(pat, my_string, perl=TRUE)))

这是我在 R 中得到的：

> out
 [1] "\r\n\r\nDBhHB\r\n"

如您所见，它在第一个匹配项处停止。

编辑

在这种情况下，我的预期输出将是字符串的最后部分。

> out
 [1] "DBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU\r\n"

如果字符串中间的其他块中还有一两个白色 space，我希望能够检索多个部分。

my_string <- "\r\nNot This\r\n\r\KeepThis\r\nKeepThis\r\nNot This\r\nKeepThis\r\n"

非常感谢根据基本 R 方法提出的建议。

提前致谢。

Answer 1

我建议使用

(?m)^\S+(?:\R\S+)*$

见regex demo。详情：

(?m) - 开启多行模式
^ - 这个锚点现在匹配所有行的起始位置
\S+ - 一个或多个 non-whitespace 个字符
(?:\R\S+)* - 零次或多次重复换行符序列，然后是一个或多个 non-whitespace 个字符
$ - 一行结束。

R demo:

library(stringr)
my_string <- "\r\nContent: base64\r\n\r\nDBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU\r\n"
pat <- "(?m)^\S+(?:\R\S+)*$"
unlist(str_extract_all(my_string, pat))
## => [1] "DBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU"

my_string <- "\r\nNot This\r\n\r\nKeepThis\r\nKeepThis\r\nNot This\r\nKeepThis\r\n"
unlist(str_extract_all(my_string, pat))
## => [1] "KeepThis\r\nKeepThis" "KeepThis"

Base R 用法

请注意，在 base R 中，使用了 PCRE 引擎，并且 $ 在多行模式下（当使用 (?m) 时） 仅在 \n 之前匹配。因为你有 \r\n 行尾，你不能使用普通的 $ 来标记行尾。使用 \r 不是一个好主意 (\r$)，因为您不想在输出中包含 \r。 您可以使用 (*ANYCRLF) PCRE 动词:

告诉 PCRE 将 CRLF、CR 或 LF 视为行结束序列

unlist(regmatches(my_string, gregexpr("(*ANYCRLF)(?m)^\S+(?:\R\S+)*$",my_string, perl=TRUE)))

注意(*ANYCRLF)PCRE 动词必须位于正则表达式模式的开头。

参见 this R demo online。

R 中的负前瞻匹配不包含特定字符的字符串中的定界块

Negative lookahead in R to match delimited chunks in a string that do not contain an specific character

regex

r

regex-negation

regex-lookarounds