gsub 正则表达式适用于单词边界和标点符号

Question

我正在尝试在句子中搜索单词（不区分大小写）和标点符号。下面的函数适用于单词，但需要 \ 才能适用于例如点；因此会导致不良行为 - 见下文：

fun <- function(text, search) {
  gsub(paste0("\b(", search, ")\b"), paste0("<mark>", '\1', "</mark>"),
       text, ignore.case = T)
}
> fun("this is a test.", ".")
[1] "this<mark> </mark>is<mark> </mark><mark>a</mark><mark> </mark>test<mark>.</mark>"

> fun("(this is a test)", ")")
[1] "(this is a test<mark></mark>"

期待：

> fun("this is a test.", ".")
[1] "this is a test<mark>.</mark>"

> fun("(this is a test)", ")")
[1] "(this is a test<mark>)</mark>"

最好的方法是什么 - 正则表达式？ - 在字符串中搜索单词和标点符号？

Answer 1

你需要

由于动态自适应词边界是lookarounds，你需要将perl=TRUE传递给gsub。

见R code：

## Escaping function
regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\?.])", "\\\1", string)
}
fun <- function(text, search) {
  gsub(paste0("(?!\B\w)(", regex.escape(search), ")(?<!\w\B)"), "<mark>\1</mark>",
       text, ignore.case = TRUE, perl=TRUE)
}
fun("this is a test.", ".")
# [1] "this is a test<mark>.</mark>"

fun("(this is a test)", ")")
# [1] "(this is a test<mark>)</mark>"

gsub 正则表达式适用于单词边界和标点符号

gsub regex to work for both words boundaries and punctuation

regex

r