如何在多次出现模式时剪切子字符串?

How to cut a substring on several occurrences of a pattern?

在对 google 和 SO 进行彻底搜索后,我找不到关于大量正则表达式请求的具体问题。

我想解析一个字符串以替换一些子字符串。

但是,我的情况比简单的 str_replace 复杂一点,所以我需要字符串的结构化版本。

例如,我们取值 value="There is __obj1__ and also __obj2__ in the house.",模式为 __.*?__

我想得到类似 c("There is ", "obj1", "and also", "obj2", "in the house") 的东西,这样我就可以对所有偶数指数采取行动。

这是我目前的位置。我正在与正则表达式的贪婪作斗争,它要么太多要么不够。矩阵return类型其实不是问题,我可以unlist(x[[1]][-1])它。

library(tidyverse)
value="There is __obj1__ and also __obj2__ in the house."
str_match_all(value, "(.*?)__(.*?)__(.*?)") #too greedy at the very end
#> [[1]]
#>      [,1]                 [,2]         [,3]   [,4]
#> [1,] "There is __obj1__"  "There is "  "obj1" ""  
#> [2,] " and also __obj2__" " and also " "obj2" ""
str_match_all(value, "(.*)__(.*?)__(.*?)") #not greedy enough
#> [[1]]
#>      [,1]                                  [,2]                          [,3]  
#> [1,] "There is __obj1__ and also __obj2__" "There is __obj1__ and also " "obj2"
#>      [,4]
#> [1,] ""
str_match_all(value, "(.*?)__(.*)__(.*?)") #not greedy enough
#> [[1]]
#>      [,1]                                  [,2]        [,3]                    
#> [1,] "There is __obj1__ and also __obj2__" "There is " "obj1__ and also __obj2"
#>      [,4]
#> [1,] ""
str_match_all(value, "(.*?)__(.*?)__(.*)") #not greedy enough
#> [[1]]
#>      [,1]                                                [,2]        [,3]  
#> [1,] "There is __obj1__ and also __obj2__ in the house." "There is " "obj1"
#>      [,4]                              
#> [1,] " and also __obj2__ in the house."

reprex package (v0.3.0)

于 2021 年 1 月 19 日创建

您可以使用

value <- "There is __obj1__ and also __obj2__ in the house."
library(stringr)
result <- stringr::str_match_all(value, "\s*(.*?)__(.*?)__(.*?)(?=\s*(?:__|$))")
result <- lapply(result, function(x) x[,-1])
result

输出:

[[1]]
     [,1]        [,2]   [,3]            
[1,] "There is " "obj1" " and also"     
[2,] ""          "obj2" " in the house."

模式是

\s*(.*?)__(.*?)__(.*?)(?=\s*(?:__|$))

regex demo。请注意,您甚至可以使用具有 \s* 的所有格量词,即 \s*+ 来加速匹配。

详情:

  • \s* - 零个或多个空格
  • (.*?) - 第 1 组:除换行字符外的任何零个或多个字符尽可能少
  • __ - 文字 __ 子串
  • (.*?) - 第 2 组:除换行字符外的任何零个或多个字符尽可能少
  • __ - 文字 __ 子串
  • (.*?) - 第 3 组:除换行字符外的任何零个或多个字符尽可能少
  • (?=\s*(?:__|$)) - 需要零个或多个空格后跟 __ 或紧跟在当前位置右侧的字符串结尾的正前瞻。