如何提取分隔符之间的 select 字符集

How to extract a select set of characters between delimiters

我有这样的文本数据,其中字符串按分隔符分组,一些单词附有数字:

reps <- c("<#> <rep> From the <{1> <[1> 1nEw 2ROyal% <,> </[1> you can see that 1fOUntain in the Dunville 2PArk% <{2> <[2> so-it-is%@ </[2> </rep>",
        "<#> <rep> <[1> That 's 2right </rep> <#> something else <rep> I 1went on my 3Own </[1> </{1> </rep>",
        "<#> <exp> <[2> Oh* 2absolUtely% </[2> </{2> </exp> <#> <rep> I know <{1> <[1> every inch </[1> and% <,> <{2> <[2> 1Every nook and 2crAnny of it% </[2> </rep>")

我需要提取 <rep> ... </rep> 分隔符 内的那些 数字。作为一个额外的困难,还有其他数字,每个数字前面都有 {[,我不想提取它们。

期望的输出是这样的:

"1212" "2" "13" "12"

摆脱不需要的数字很容易,即使用嵌入式 gsub 替换,但将提取限制在 <rep> ... </rep> 分隔符之间的数字要困难得多。我的预感是,后视和前视将成为解决方案的一部分。我不清楚如何实施它们。这是我尝试过的方法,但它并不完美:

library(stringr)
str_extract_all(gsub("(?<=\{|\[)\d", "", reps, perl = T), "(?<=<rep>)(?!</rep>).*\d.*?(?=</rep>)")
[[1]]
[1] " From the <{> <[> 1nEw 2ROyal% <,> </[> you can see that 1fOUntain in the Dunville 2PArk% <{> <[> so-it-is%@ </[> "

[[2]]
[1] " <[> That 's 2right </rep> <#> something else <rep> I 1went on my 3Own </[> </{> "

[[3]]
[1] " I know <{> <[> every inch </[> and% <,> <{> <[> 1Every nook and 2crAnny of it% </[> "

有什么见解吗?

编辑:

根据@GK 的回答得出的stringr解决方案:

gsub("\D", "", unlist(lapply(gsub("(?<=\{|\[)\d", "", reps, perl = T), function(x) str_extract_all(x, "<rep>.*?</rep>"))))
[1] "1212" "2"    "13"   "12"

您可以先用gsub替换您不感兴趣的[{开头的数字。然后使用 gregexprregmatches 提取 <rep></rep> 之间的部分,然后再次使用 gsub 删除所有不是数字的部分。

x <- gsub("(\{|\[)\d+", "", reps)
unlist(lapply(regmatches(x, gregexpr("<rep>.*?</rep>", x)),
  gsub, pattern="\D", replacement=""))
#[1] "1212" "2"    "13"   "12"