从 x 的第一次出现到 y 的第 n 次出现捕获

Question

我有一个看起来相当可行的正则表达式问题，但我似乎无法正确解决。

给出如下所示的字符串：

id1234|a; b; c; d
id5678|a; b; e; f
id9012|a; g; h; i

我正在尝试从管道中捕获到固定出现的分号，直觉上我希望是这样的：

"(?<=\|)(([^;]*;){1}[^;]*).*"

到 select 从管道到第二个分号，因为：

"^(([^;]*;){1}[^;]*).*"

selects 从行首到第二个分号。

https://regexr.com/

相信：

(?<=\|)(([^;]*;){1}[^;]*).*

正确地从管道向前 selects，但似乎未能在正确的分号处结束捕获。

但是在 R gsub 中抱怨：

a <- c("id1234|a; b; c; d",
       "id5678|a; b; e; f",
       "id9012|a; g; h; i",
       "id3456|a; j; k; l")

b <- gsub(pattern = "^(([^;]*;){1}[^;]*).*",
          replacement = "\1",
          x = a)
b
[1] "id1234|a; b" "id5678|a; b" "id9012|a; g" "id3456|a; j"

c <- gsub(pattern = "(?<=\|)(([^;]*;){1}[^;]*).*",
          replacement = "\1",
          x = a)
Error in gsub(pattern = "(?<=\|)(([^;]*;){1}[^;]*).*", replacement = "\1",  : 
  invalid regular expression '(?<=\|)(([^;]*;){1}[^;]*).*', reason 'Invalid regexp'
In addition: Warning message:
In gsub(pattern = "(?<=\|)(([^;]*;){1}[^;]*).*", replacement = "\1",  :
  TRE pattern compilation error 'Invalid regexp'

Answer 1

您收到错误是因为您正在使用后视 (?<=\|)，但没有使用 perl=TRUE 参数来启用支持后视的 PCRE 正则表达式引擎。默认的 TRE 正则表达式引擎不支持环视。

您需要匹配整个字符串：

sub("^[^|]*\|([^;]*;[^;]*).*", "\1", a)

见regex demo and the R demo:

a <- c("id1234|a; b; c; d",
       "id5678|a; b; e; f",
       "id9012|a; g; h; i",
       "id3456|a; j; k; l")
sub("^[^|]*\|([^;]*;[^;]*).*", "\1", a)
## => [1] "a; b" "a; b" "a; g" "a; j"

正则表达式详细信息:

^ - 字符串开始
[^|]* - |
\| - 一个 | 字符
([^;]*;[^;]*) - 第 1 组：除 ;、; 以外的任何零个或多个字符，以及 ;[=46= 以外的任何零个或多个字符]
.* - 字符串的其余部分。

如果你想提取，你实际上应该使用提取正则表达式方法，例如：

a <- c("id1234|a; b; c; d", "id5678|a; b; e; f", "id9012|a; g; h; i", "id3456|a; j; k; l")
unlist(regmatches(a, gregexpr("(?<=\|)[^;]*;[^;]*", a, perl=TRUE)))
## => [1] "a; b" "a; b" "a; g" "a; j"

library(stringr)
str_extract(a, "(?<=\|)[^;]*;[^;]*")
## => [1] "a; b" "a; b" "a; g" "a; j"

从 x 的第一次出现到 y 的第 n 次出现捕获

Capture from 1st occurance of x to nth occurance of y

r

regex

gsub