R: gsub with fixed=T or F 和特殊情况

R: gsub with fixed=T or F and special cases

基于我之前提出的两个问题:

R: How to prevent memory overflow when using mgsub in vector mode?

gsub speed vs pattern length

我确实喜欢@Tyler 关于使用 fixed=TRUE 的建议,因为它可以显着加快计算速度。但是,它并不总是适用。例如,我需要将 caps 替换为带有围绕它的标点符号的独立单词或 w/o。先验 它不知道单词后面或前面可以是什么,但它必须是任何常规标点符号(、. ! - + 等)。它不能是数字或字母。下面的例子。 capsule 必须保持原样。

i = "Here is the capsule, caps key, and two caps, or two caps. or even three caps-"          

orig = "caps"
change = "cap"

gsub_FixedTrue <- function(i) {
  i = paste0(" ", i, " ")
  orig = paste0(" ", orig, " ")
  change = paste0(" ", change, " ")

  i = gsub(orig,change,i,fixed=TRUE)
  i = gsub("^\s|\s$", "", i, perl=TRUE)

  return(i)
}

#Second fastest, doesn't clog memory
gsub_FixedFalse <- function(i) {

  i = gsub(paste0("\b",orig,"\b"),change,i)

  return(i)
}

print(gsub_FixedTrue(i)) #wrong
print(gsub_FixedFalse(i)) #correct

结果。需要第二个输出

[1] "Here is the capsule, cap key, and two caps, or two caps. or even three caps-"
[1] "Here is the capsule, cap key, and two cap, or two cap. or even three cap-"

使用你上一个问题的部分来测试我认为我们可以在标点符号前面放置一个占位符,如下所示,不会减慢太多:

line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key",
    "Here is the capsule, caps key, and two caps, or two caps. or even three caps-" )
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "cap")


line <- rep(line, 1700000/length(line))

line <- gsub("([[:punct:]])", " <DEL>\1<DEL> ", line, perl=TRUE)

## Start    
line2 <- paste0(" ", line, " ")
e2 <-  paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")


for (i in seq_along(e2)) {
    line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}

gsub("^\s|\s$| <DEL>|<DEL> ", "", line2, perl=TRUE)