gsub 正则表达式适用于单词边界和标点符号

gsub regex to work for both words boundaries and punctuation

我正在尝试在句子中搜索单词(不区分大小写)和标点符号。 下面的函数适用于单词,但需要 \ 才能适用于例如点;因此会导致不良行为 - 见下文:

fun <- function(text, search) {
  gsub(paste0("\b(", search, ")\b"), paste0("<mark>", '\1', "</mark>"),
       text, ignore.case = T)
}
> fun("this is a test.", ".")
[1] "this<mark> </mark>is<mark> </mark><mark>a</mark><mark> </mark>test<mark>.</mark>"

> fun("(this is a test)", ")")
[1] "(this is a test<mark></mark>"

期待:

> fun("this is a test.", ".")
[1] "this is a test<mark>.</mark>"

> fun("(this is a test)", ")")
[1] "(this is a test<mark>)</mark>"

最好的方法是什么 - 正则表达式? - 在字符串中搜索单词和标点符号?

你需要

  • 由于动态自适应词边界是lookarounds,你需要将perl=TRUE传递给gsub

R code

## Escaping function
regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\?.])", "\\\1", string)
}
fun <- function(text, search) {
  gsub(paste0("(?!\B\w)(", regex.escape(search), ")(?<!\w\B)"), "<mark>\1</mark>",
       text, ignore.case = TRUE, perl=TRUE)
}
fun("this is a test.", ".")
# [1] "this is a test<mark>.</mark>"

fun("(this is a test)", ")")
# [1] "(this is a test<mark>)</mark>"