使用 gsub 和多个条件清理字符串

Clean string using gsub and multiple conditions

我已经看过这个了,但这不是我需要的:


情况: 使用gsub,我想清理字符串。这些是我的条件:

  1. 只保留单词(没有数字也没有“奇怪”的符号)
  2. 将这些单词与(仅一个)' - _ $ . 中的一个分开。例如:don't re-loadingcome_homesomething$col
  3. 保留特定名称,例如 package::functionpackage::function()

所以,我有以下内容:

  1. [^A-Za-z]
  2. ([a-z]+)(-|'|_|$)([a-z]+)
  3. ([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)(\(\))*

示例:

如果我有以下情况:

# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay

我想要

Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay

问题:我有几个:

一个。第二个表达式不能正常工作。目前,它仅适用于 -'

乙。如何在 R 中将所有这些组合到一个 gsub 中?我想做类似 gsub(myPatterns, myText) 的事情,但不知道如何修复和组合所有这些。

你可以使用

trimws(gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE))

regex demo。或者,要用单个 space 替换多个白色 space,请使用

trimws(gsub("\s{2,}", " ", gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE)))

详情

  • (?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F):匹配两种模式之一:
    • \w+::\w+(?:\(\))? - 1+ 个单词字符,::,1+ 个单词字符和一个可选的 () 子字符串
    • | - 或
    • \p{L}+ - 一个或多个 Unicode 字母
    • (?:[-'_$]\p{L}+)* - -'_$ 的 0+ 次重复,然后是 1+ Unicode 字母
  • (*SKIP)(*F) - 省略并跳过匹配
  • | - 或
  • [^\p{L}\s] - 除了 Unicode 字母和 whitespace
  • 之外的任何字符

参见R demo

myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\s{2,}", " ", gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE)))

输出:

[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"                                                  
[3] "Update href of toc anchors use instead"                                                   
[4] "Keep something$col or here_you::must_stay"    

或者,

txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't", 
         "# Needs to handle NA for desc::desc_get()",
         "# Update href of toc anchors , use \"-\" instead \".\"", 
         "# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
            "Needs to handle NA for desc::desc_get()",
            "Update href of toc anchors use instead",
            "Keep something$col or here_you::must_stay")

leadspace <- grepl("^ ", txt)
gre <- gregexpr("\b(\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\(\))?)\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE