删除 R 中字符之间除下划线外的所有标点符号 POSIX 字符 class

Question

我想使用 R 删除单词之间的所有下划线。最后，代码会删除单词末尾或开头的下划线。结果应该是 'hello_world and hello_world'。我想使用那些预先构建的类。没错，我已经学会了使用以下代码期待特定字符，但我不知道如何使用单词边界序列。

test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)

Answer 1

一种 non-regex 方法是通过将 whitespace 参数设置为 _ 来拆分和使用 trimws，即

paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"

Answer 2

您可以使用：

test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\W])_|_(?![^\W])", "", test, perl=TRUE)
output

[1] "hello_world and hello_world"

正则表达式解释：

(?<![^\W])  assert that what precedes is a non word character OR the start of the input
_            match an underscore to remove
|            OR
_            match an underscore to remove, followed by
(?![^\W])   assert that what follows is a non word character OR the end of the input

Answer 3

您可以使用

gsub("[^_[:^punct:]]|_+\b|\b_+", "", test, perl=TRUE)

见regex demo

详情:

[^_[:^punct:]] - 除了 _
| - 或
_+\b - 一个或多个_在一个词的末尾
| - 或
\b_+ - 单词开头的一个或多个 _

Answer 4

我们可以移除所有在任何一端都有单词边界的底层证券。我们使用积极的前瞻和后视正则表达式来找到这样的基础。我们使用 trimws.

移除开始和结束的底层证券

test<-"hello_world and _hello_world_"
gsub("(?<=\b)_|_(?=\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"

删除 R 中字符之间除下划线外的所有标点符号 POSIX 字符 class

Remove all punctuation except underline between characters in R with POSIX character class

posix

r

gsub