R:使用 GSUB 删除包含 3 个或更多重复字母的单词

R : Remove words with 3 or more repeating letters using GSUB

我需要使用 gsub 从字符串中删除包含 3 个或更多重复字母的单词。示例:

"It has been raining verrrry badly heeere last few days"

我需要使用 gsub 函数获取以下内容:

"It has been raining badly last few days"。 'verrrry' 和 'heeere' 个词已从字符串中删除。

这看起来像您想要的输出字符串:

origStr = "It has been raining verrrry badly heeere last few days"

newStr <- gsub("e{3,}","e", origStr ) # replaces e's greater than 2 repeat
(newStr <- gsub("r{3,}","r", newStr )) # replaces r's greater than 2 repeat

# [1] "It has been raining very badly here last few days"

这是一种方法:

library(tm)
data("acq")
acq[[12]]$content -> sometext
tm::MC_tokenizer(x = sometext) -> q
q[131] <- "eeee"

sapply(letters, FUN = function(x) {
    grepl(paste0(x, "{3,}"), x = q, ignore.case = TRUE) -> k
    k
}) -> zz

apply(X = zz, 1, sum) -> flag
q[ifelse(flag == 1, FALSE, TRUE)] -> newq
paste(newq, collapse = " ") -> final

一个可能的解决方案,首先为您的案例构建一个正则表达式。

regExp <- paste(sapply(letters, paste, "{3,}", sep = ""), collapse = "|")
> regExp

"a{3,}|b{3,}|c{3,}|d{3,}|e{3,}|f{3,}|g{3,}|h{3,}|i{3,}|j{3,}|k{3,}|l{3,}|m{3,}|n{3,}|o{3,}|p{3,}|q{3,}|r{3,}|s{3,}|t{3,}|u{3,}|v{3,}|w{3,}|x{3,}|y{3,}|z{3,}"

words <- unlist(strsplit(origStr, "\s+"))
cleanStr <- paste(words[!grepl(regExp, words)], collapse = " ")
cleanStr
[1] "It has been raining badly last few days"

选项 1

x <- "It has been raining verrrry badly heeere last few days"
m <- gregexpr('\s\b\w*(\w)\1{2,}\w*\b\s', x, perl = TRUE)
regmatches(x, m) <- ' '
x
# [1] "It has been raining badly last few days"

选项 2

x <- "It has been raining verrrry badly heeere last few days"
sp <- strsplit(x, ' ')[[1]]
s <- sp[!sapply(sp, function(y) any(rle(strsplit(y, '')[[1]])$lengths >= 3))]
paste(s, collapse = ' ')
# [1] "It has been raining badly last few days"

快速搜索 this SO answer。使用它你可以做类似的事情:

## string with repeated letters
s <- "It has been raining verrrry badly heeere last few days"

## split string into vector of words to select
svec <- unlist(strsplit(s, " "))

## find words with 3 or more repeated letters/numbers
## (for any general symbol use '.' instead of '\w')
rmword <- grep("(\w)\1{2, }", svec)

## join words into single string again, removing the unwanted ones
paste(svec[-rmword], collapse = " ")

## output:
[1] "It has been raining badly last few days"

EDIT 回复后续请求 .也将 grep 更改为 grepl

首先让我们把代码包装成一个函数:

rm.repeatLetters <- function(x){
  xvec <- unlist(strsplit(x, " "))
  rmword <- grepl("(\w)\1{2, }", xvec)
  return(paste(xvec[!rmword], collapse = " "))
}

然后在数据框上使用它:

df <- data.frame(id=c(1, 2, 3), text=c(s, s, s), stringsAsFactors=FALSE)
## > df
##   id                                                   text
## 1  1 It has been raining verrrry badly heeere last few days
## 2  2 It has been raining verrrry badly heeere last few days
## 3  3 It has been raining verrrry badly heeere last few days


df$text <- sapply(df$text, rm.repeatLetters)
## > df
##   id                                    text
## 1  1 It has been raining badly last few days
## 2  2 It has been raining badly last few days
## 3  3 It has been raining badly last few days