删除 R 中某个字符串模式后的前 4 个单词?
Remove first 4 words after a certain string pattern in R?
我正在处理很长的字符串。出现特定字符串模式后,如何删除前 4 个单词?例如:
string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
#remove the fist 4 words after and including "Whosebug"
result
"hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
搜索您的模式并在其后添加空格和单词。找到第一个最后匹配的位置,拆分字符串并将其粘贴回一起。最后 gsub 任何双(或更多)空格。
string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
pat="Whosebug"
library(stringr)
tmp=str_locate(
string,
paste0(
pat,
paste0(
rep("\s?[a-zA-Z]+",4),
collapse=""
)
)
)
gsub("\s{2,}"," ",
paste0(
substring(string,1,tmp[1]-1),
substring(string,tmp[2]+1)
)
)
[1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
快速回答,我相信你可以有比这更好的代码:
string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
t<-read.table(textConnection(string))
string2<-''
i<-0
j<-0
for(i in 1:length(t)){
if(t[i]=="Whosebug"){
j=i
}else if(j>0){
if(i-j>4){
string2=paste0(string2, " " , t[i])
}
}else if(j==0){
if(i>1){
string2=paste0(string2, " " , t[i])
}else{
string2=t[i]
}
}
}
print(string2)
以 R 为基数的解
一行解决方案:
pattern <- "Whosebug"
string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
工作原理
使用正则表达式创建您想要的模式:
“Whosebug”后跟 4 个词。
当然,查看 ?regex
了解更多信息。
单词由 \w+
标识,分隔符由 \W+
标识(大写 w,它包括空格和特殊字符,例如句子中的撇号)
(...){0,4}
表示单词和分隔符的组合最多可以重复4次。
\W*
需要确定一个可能的最终分隔符,这样句子的其余两部分就不会被两个分隔符分隔。试试看,你会明白我的意思。
gsub
找到您想要的模式并将其替换为“”(从而删除它)。
处理异常
请注意,它甚至适用于特定情况:
# end of a sentence with fewer than 4 words after
string <- "hello I am a user of Whosebug and I am"
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] "hello I am a user of "
# beginning of a sentence
string <- "Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] "happy with all the help the community usually offers when I'm in need of some coding expertise."
# pattern == string
string <- "Whosebug and I am really"
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] ""
一个tidyverse
解决方案
library(stringr)
# locate start and end position of pattern
tmp <- str_locate(string, paste0(pattern,"(\W+\w+){0,4}\W*"))
# get positions: start_sentence-start_pattern and end_pattern-end_sentence
tmp <- invert_match(tmp)
# get the substrings
tmp <- str_sub(string, tmp[,1], tmp[,2])
# collapse substrings together
str_c(tmp, collapse = "")
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
我正在处理很长的字符串。出现特定字符串模式后,如何删除前 4 个单词?例如:
string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
#remove the fist 4 words after and including "Whosebug"
result
"hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
搜索您的模式并在其后添加空格和单词。找到第一个最后匹配的位置,拆分字符串并将其粘贴回一起。最后 gsub 任何双(或更多)空格。
string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
pat="Whosebug"
library(stringr)
tmp=str_locate(
string,
paste0(
pat,
paste0(
rep("\s?[a-zA-Z]+",4),
collapse=""
)
)
)
gsub("\s{2,}"," ",
paste0(
substring(string,1,tmp[1]-1),
substring(string,tmp[2]+1)
)
)
[1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
快速回答,我相信你可以有比这更好的代码:
string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
t<-read.table(textConnection(string))
string2<-''
i<-0
j<-0
for(i in 1:length(t)){
if(t[i]=="Whosebug"){
j=i
}else if(j>0){
if(i-j>4){
string2=paste0(string2, " " , t[i])
}
}else if(j==0){
if(i>1){
string2=paste0(string2, " " , t[i])
}else{
string2=t[i]
}
}
}
print(string2)
以 R 为基数的解
一行解决方案:
pattern <- "Whosebug"
string <- "hello I am a user of Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
工作原理
使用正则表达式创建您想要的模式:
“Whosebug”后跟 4 个词。
当然,查看 ?regex
了解更多信息。
单词由 \w+
标识,分隔符由 \W+
标识(大写 w,它包括空格和特殊字符,例如句子中的撇号)
(...){0,4}
表示单词和分隔符的组合最多可以重复4次。
\W*
需要确定一个可能的最终分隔符,这样句子的其余两部分就不会被两个分隔符分隔。试试看,你会明白我的意思。
gsub
找到您想要的模式并将其替换为“”(从而删除它)。
处理异常
请注意,它甚至适用于特定情况:
# end of a sentence with fewer than 4 words after
string <- "hello I am a user of Whosebug and I am"
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] "hello I am a user of "
# beginning of a sentence
string <- "Whosebug and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] "happy with all the help the community usually offers when I'm in need of some coding expertise."
# pattern == string
string <- "Whosebug and I am really"
gsub(paste0(pattern, "(\W+\w+){0,4}\W*"), "", string)
#> [1] ""
一个tidyverse
解决方案
library(stringr)
# locate start and end position of pattern
tmp <- str_locate(string, paste0(pattern,"(\W+\w+){0,4}\W*"))
# get positions: start_sentence-start_pattern and end_pattern-end_sentence
tmp <- invert_match(tmp)
# get the substrings
tmp <- str_sub(string, tmp[,1], tmp[,2])
# collapse substrings together
str_c(tmp, collapse = "")
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."