使用 gsub 和多个条件清理字符串
Clean string using gsub and multiple conditions
我已经看过这个了,但这不是我需要的:
- regex multiple pattern with singular replacement
情况: 使用gsub
,我想清理字符串。这些是我的条件:
- 只保留单词(没有数字也没有“奇怪”的符号)
- 将这些单词与(仅一个)
' - _ $ .
中的一个分开。例如:don't
、 re-loading
、come_home
、something$col
- 保留特定名称,例如
package::function
或 package::function()
所以,我有以下内容:
[^A-Za-z]
([a-z]+)(-|'|_|$)([a-z]+)
([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)(\(\))*
示例:
如果我有以下情况:
# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay
我想要
Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay
问题:我有几个:
一个。第二个表达式不能正常工作。目前,它仅适用于 -
或 '
乙。如何在 R 中将所有这些组合到一个 gsub
中?我想做类似 gsub(myPatterns, myText)
的事情,但不知道如何修复和组合所有这些。
你可以使用
trimws(gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE))
见regex demo。或者,要用单个 space 替换多个白色 space,请使用
trimws(gsub("\s{2,}", " ", gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE)))
详情
(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)
:匹配两种模式之一:
\w+::\w+(?:\(\))?
- 1+ 个单词字符,::
,1+ 个单词字符和一个可选的 ()
子字符串
|
- 或
\p{L}+
- 一个或多个 Unicode 字母
(?:[-'_$]\p{L}+)*
- -
、'
、_
或 $
的 0+ 次重复,然后是 1+ Unicode 字母
(*SKIP)(*F)
- 省略并跳过匹配
|
- 或
[^\p{L}\s]
- 除了 Unicode 字母和 whitespace 之外的任何字符
参见R demo:
myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\s{2,}", " ", gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE)))
输出:
[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"
[3] "Update href of toc anchors use instead"
[4] "Keep something$col or here_you::must_stay"
或者,
txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
"# Update href of toc anchors , use \"-\" instead \".\"",
"# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
"Needs to handle NA for desc::desc_get()",
"Update href of toc anchors use instead",
"Keep something$col or here_you::must_stay")
leadspace <- grepl("^ ", txt)
gre <- gregexpr("\b(\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\(\))?)\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE
我已经看过这个了,但这不是我需要的:
- regex multiple pattern with singular replacement
情况: 使用gsub
,我想清理字符串。这些是我的条件:
- 只保留单词(没有数字也没有“奇怪”的符号)
- 将这些单词与(仅一个)
' - _ $ .
中的一个分开。例如:don't
、re-loading
、come_home
、something$col
- 保留特定名称,例如
package::function
或package::function()
所以,我有以下内容:
[^A-Za-z]
([a-z]+)(-|'|_|$)([a-z]+)
([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)(\(\))*
示例:
如果我有以下情况:
# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay
我想要
Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay
问题:我有几个:
一个。第二个表达式不能正常工作。目前,它仅适用于 -
或 '
乙。如何在 R 中将所有这些组合到一个 gsub
中?我想做类似 gsub(myPatterns, myText)
的事情,但不知道如何修复和组合所有这些。
你可以使用
trimws(gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE))
见regex demo。或者,要用单个 space 替换多个白色 space,请使用
trimws(gsub("\s{2,}", " ", gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE)))
详情
(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)
:匹配两种模式之一:\w+::\w+(?:\(\))?
- 1+ 个单词字符,::
,1+ 个单词字符和一个可选的()
子字符串|
- 或\p{L}+
- 一个或多个 Unicode 字母(?:[-'_$]\p{L}+)*
--
、'
、_
或$
的 0+ 次重复,然后是 1+ Unicode 字母
(*SKIP)(*F)
- 省略并跳过匹配|
- 或[^\p{L}\s]
- 除了 Unicode 字母和 whitespace 之外的任何字符
参见R demo:
myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\s{2,}", " ", gsub("(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)|[^\p{L}\s]", "", myText, perl=TRUE)))
输出:
[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"
[3] "Update href of toc anchors use instead"
[4] "Keep something$col or here_you::must_stay"
或者,
txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
"# Update href of toc anchors , use \"-\" instead \".\"",
"# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
"Needs to handle NA for desc::desc_get()",
"Update href of toc anchors use instead",
"Keep something$col or here_you::must_stay")
leadspace <- grepl("^ ", txt)
gre <- gregexpr("\b(\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\(\))?)\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE