使用 gsub 和多个条件清理字符串

Question

我已经看过这个了，但这不是我需要的：

regex multiple pattern with singular replacement

情况： 使用gsub，我想清理字符串。这些是我的条件：

只保留单词（没有数字也没有“奇怪”的符号）
将这些单词与（仅一个）' - _ $ . 中的一个分开。例如：don't、 re-loading、come_home、something$col
保留特定名称，例如 package::function 或 package::function()

所以，我有以下内容：

[^A-Za-z]
([a-z]+)(-|'|_|$)([a-z]+)
([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)()*

示例：

如果我有以下情况：

# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay

我想要

Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay

问题：我有几个：

一个。第二个表达式不能正常工作。目前，它仅适用于 - 或 '

乙。如何在 R 中将所有这些组合到一个 gsub 中？我想做类似 gsub(myPatterns, myText) 的事情，但不知道如何修复和组合所有这些。

Answer 1

你可以使用

trimws(gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE))

见regex demo。或者，要用单个 space 替换多个白色 space，请使用

trimws(gsub("\s{2,}", " ", gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE)))

详情

(?:\w+::\w+(?:)?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)：匹配两种模式之一：
- \w+::\w+(?:)? - 1+ 个单词字符，::，1+ 个单词字符和一个可选的 () 子字符串
- | - 或
- \p{L}+ - 一个或多个 Unicode 字母
- (?:[-'_$]\p{L}+)* - -、'、_ 或 $ 的 0+ 次重复，然后是 1+ Unicode 字母
(*SKIP)(*F) - 省略并跳过匹配
| - 或
[^\p{L}\s] - 除了 Unicode 字母和 whitespace

参见R demo：

myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\s{2,}", " ", gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE)))

输出：

[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"                                                  
[3] "Update href of toc anchors use instead"                                                   
[4] "Keep something$col or here_you::must_stay"

Answer 2

或者，

txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't", 
         "# Needs to handle NA for desc::desc_get()",
         "# Update href of toc anchors , use \"-\" instead \".\"", 
         "# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
            "Needs to handle NA for desc::desc_get()",
            "Update href of toc anchors use instead",
            "Keep something$col or here_you::must_stay")

leadspace <- grepl("^ ", txt)
gre <- gregexpr("\b(\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\(\))?)\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE

使用 gsub 和多个条件清理字符串

Clean string using gsub and multiple conditions

regex

r

gsub