R:使用 GSUB 删除包含 3 个或更多重复字母的单词
R : Remove words with 3 or more repeating letters using GSUB
我需要使用 gsub 从字符串中删除包含 3 个或更多重复字母的单词。示例:
"It has been raining verrrry badly heeere last few days"
我需要使用 gsub 函数获取以下内容:
"It has been raining badly last few days"。 'verrrry' 和 'heeere' 个词已从字符串中删除。
这看起来像您想要的输出字符串:
origStr = "It has been raining verrrry badly heeere last few days"
newStr <- gsub("e{3,}","e", origStr ) # replaces e's greater than 2 repeat
(newStr <- gsub("r{3,}","r", newStr )) # replaces r's greater than 2 repeat
# [1] "It has been raining very badly here last few days"
这是一种方法:
library(tm)
data("acq")
acq[[12]]$content -> sometext
tm::MC_tokenizer(x = sometext) -> q
q[131] <- "eeee"
sapply(letters, FUN = function(x) {
grepl(paste0(x, "{3,}"), x = q, ignore.case = TRUE) -> k
k
}) -> zz
apply(X = zz, 1, sum) -> flag
q[ifelse(flag == 1, FALSE, TRUE)] -> newq
paste(newq, collapse = " ") -> final
一个可能的解决方案,首先为您的案例构建一个正则表达式。
regExp <- paste(sapply(letters, paste, "{3,}", sep = ""), collapse = "|")
> regExp
"a{3,}|b{3,}|c{3,}|d{3,}|e{3,}|f{3,}|g{3,}|h{3,}|i{3,}|j{3,}|k{3,}|l{3,}|m{3,}|n{3,}|o{3,}|p{3,}|q{3,}|r{3,}|s{3,}|t{3,}|u{3,}|v{3,}|w{3,}|x{3,}|y{3,}|z{3,}"
words <- unlist(strsplit(origStr, "\s+"))
cleanStr <- paste(words[!grepl(regExp, words)], collapse = " ")
cleanStr
[1] "It has been raining badly last few days"
选项 1
x <- "It has been raining verrrry badly heeere last few days"
m <- gregexpr('\s\b\w*(\w)\1{2,}\w*\b\s', x, perl = TRUE)
regmatches(x, m) <- ' '
x
# [1] "It has been raining badly last few days"
选项 2
x <- "It has been raining verrrry badly heeere last few days"
sp <- strsplit(x, ' ')[[1]]
s <- sp[!sapply(sp, function(y) any(rle(strsplit(y, '')[[1]])$lengths >= 3))]
paste(s, collapse = ' ')
# [1] "It has been raining badly last few days"
快速搜索 this SO answer。使用它你可以做类似的事情:
## string with repeated letters
s <- "It has been raining verrrry badly heeere last few days"
## split string into vector of words to select
svec <- unlist(strsplit(s, " "))
## find words with 3 or more repeated letters/numbers
## (for any general symbol use '.' instead of '\w')
rmword <- grep("(\w)\1{2, }", svec)
## join words into single string again, removing the unwanted ones
paste(svec[-rmword], collapse = " ")
## output:
[1] "It has been raining badly last few days"
EDIT 回复后续请求
.也将 grep
更改为 grepl
首先让我们把代码包装成一个函数:
rm.repeatLetters <- function(x){
xvec <- unlist(strsplit(x, " "))
rmword <- grepl("(\w)\1{2, }", xvec)
return(paste(xvec[!rmword], collapse = " "))
}
然后在数据框上使用它:
df <- data.frame(id=c(1, 2, 3), text=c(s, s, s), stringsAsFactors=FALSE)
## > df
## id text
## 1 1 It has been raining verrrry badly heeere last few days
## 2 2 It has been raining verrrry badly heeere last few days
## 3 3 It has been raining verrrry badly heeere last few days
df$text <- sapply(df$text, rm.repeatLetters)
## > df
## id text
## 1 1 It has been raining badly last few days
## 2 2 It has been raining badly last few days
## 3 3 It has been raining badly last few days
我需要使用 gsub 从字符串中删除包含 3 个或更多重复字母的单词。示例:
"It has been raining verrrry badly heeere last few days"
我需要使用 gsub 函数获取以下内容:
"It has been raining badly last few days"。 'verrrry' 和 'heeere' 个词已从字符串中删除。
这看起来像您想要的输出字符串:
origStr = "It has been raining verrrry badly heeere last few days"
newStr <- gsub("e{3,}","e", origStr ) # replaces e's greater than 2 repeat
(newStr <- gsub("r{3,}","r", newStr )) # replaces r's greater than 2 repeat
# [1] "It has been raining very badly here last few days"
这是一种方法:
library(tm)
data("acq")
acq[[12]]$content -> sometext
tm::MC_tokenizer(x = sometext) -> q
q[131] <- "eeee"
sapply(letters, FUN = function(x) {
grepl(paste0(x, "{3,}"), x = q, ignore.case = TRUE) -> k
k
}) -> zz
apply(X = zz, 1, sum) -> flag
q[ifelse(flag == 1, FALSE, TRUE)] -> newq
paste(newq, collapse = " ") -> final
一个可能的解决方案,首先为您的案例构建一个正则表达式。
regExp <- paste(sapply(letters, paste, "{3,}", sep = ""), collapse = "|")
> regExp
"a{3,}|b{3,}|c{3,}|d{3,}|e{3,}|f{3,}|g{3,}|h{3,}|i{3,}|j{3,}|k{3,}|l{3,}|m{3,}|n{3,}|o{3,}|p{3,}|q{3,}|r{3,}|s{3,}|t{3,}|u{3,}|v{3,}|w{3,}|x{3,}|y{3,}|z{3,}"
words <- unlist(strsplit(origStr, "\s+"))
cleanStr <- paste(words[!grepl(regExp, words)], collapse = " ")
cleanStr
[1] "It has been raining badly last few days"
选项 1
x <- "It has been raining verrrry badly heeere last few days"
m <- gregexpr('\s\b\w*(\w)\1{2,}\w*\b\s', x, perl = TRUE)
regmatches(x, m) <- ' '
x
# [1] "It has been raining badly last few days"
选项 2
x <- "It has been raining verrrry badly heeere last few days"
sp <- strsplit(x, ' ')[[1]]
s <- sp[!sapply(sp, function(y) any(rle(strsplit(y, '')[[1]])$lengths >= 3))]
paste(s, collapse = ' ')
# [1] "It has been raining badly last few days"
快速搜索 this SO answer。使用它你可以做类似的事情:
## string with repeated letters
s <- "It has been raining verrrry badly heeere last few days"
## split string into vector of words to select
svec <- unlist(strsplit(s, " "))
## find words with 3 or more repeated letters/numbers
## (for any general symbol use '.' instead of '\w')
rmword <- grep("(\w)\1{2, }", svec)
## join words into single string again, removing the unwanted ones
paste(svec[-rmword], collapse = " ")
## output:
[1] "It has been raining badly last few days"
EDIT 回复后续请求
.也将 grep
更改为 grepl
首先让我们把代码包装成一个函数:
rm.repeatLetters <- function(x){
xvec <- unlist(strsplit(x, " "))
rmword <- grepl("(\w)\1{2, }", xvec)
return(paste(xvec[!rmword], collapse = " "))
}
然后在数据框上使用它:
df <- data.frame(id=c(1, 2, 3), text=c(s, s, s), stringsAsFactors=FALSE)
## > df
## id text
## 1 1 It has been raining verrrry badly heeere last few days
## 2 2 It has been raining verrrry badly heeere last few days
## 3 3 It has been raining verrrry badly heeere last few days
df$text <- sapply(df$text, rm.repeatLetters)
## > df
## id text
## 1 1 It has been raining badly last few days
## 2 2 It has been raining badly last few days
## 3 3 It has been raining badly last few days