删除 R 中某个字符串模式后的前 4 个单词?

Remove first 4 words after a certain string pattern in R?

我正在处理很长的字符串。出现特定字符串模式后,如何删除前 4 个单词?例如:

string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise." 

#remove the fist 4 words after and including "Whosebug" 

result
"hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise." 

搜索您的模式并在其后添加空格和单词。找到第一个最后匹配的位置,拆分字符串并将其粘贴回一起。最后 gsub 任何双(或更多)空格。

string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise." 

pat="Whosebug"

library(stringr)
tmp=str_locate(
  string,
  paste0(
    pat,
    paste0(
      rep("\s?[a-zA-Z]+",4),
      collapse=""
    )
  )
)

gsub("\s{2,}"," ",
  paste0(
    substring(string,1,tmp[1]-1),
    substring(string,tmp[2]+1)
  )
)

[1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."

快速回答,我相信你可以有比这更好的代码:

string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
t<-read.table(textConnection(string))
string2<-''
i<-0
j<-0
for(i in 1:length(t)){
  if(t[i]=="Whosebug"){
    j=i
  }else if(j>0){
    if(i-j>4){
      string2=paste0(string2, " " , t[i])
    }
  }else if(j==0){
    if(i>1){
      string2=paste0(string2, " " , t[i])
    }else{
      string2=t[i]
    }
  }
}
print(string2)

以 R 为基数的解

一行解决方案:

pattern <- "Whosebug"
string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise." 

gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."

工作原理

使用正则表达式创建您想要的模式: “Whosebug”后跟 4 个词。 当然,查看 ?regex 了解更多信息。

单词由 \w+ 标识,分隔符由 \W+ 标识(大写 w,它包括空格和特殊字符,例如句子中的撇号)

(...){0,4}表示单词和分隔符的组合最多可以重复4次。

\W* 需要确定一个可能的最终分隔符,这样句子的其余两部分就不会被两个分隔符分隔。试试看,你会明白我的意思。

gsub 找到您想要的模式并将其替换为“”(从而删除它)。


处理异常

请注意,它甚至适用于特定情况:

# end of a sentence with fewer than 4 words after
string <- "hello I am a user of Whosebug and I am" 
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] "hello I am a user of "

# beginning of a sentence
string <- "Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise." 
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] "happy with all the help the community usually offers when I'm in need of some coding expertise."

# pattern == string
string <- "Whosebug and I am really" 
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] ""

一个tidyverse解决方案

library(stringr)

# locate start and end position of pattern
tmp <- str_locate(string, paste0(pattern,"(\W+\w+){0,4}\W*"))

# get positions: start_sentence-start_pattern and end_pattern-end_sentence
tmp <- invert_match(tmp)

# get the substrings
tmp <- str_sub(string, tmp[,1], tmp[,2])

# collapse substrings together
str_c(tmp, collapse = "")
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."