正则表达式 gsub R 区分省略号和句号

Regex gsub R differentiate between ellipsis and periods

text="stack overflow... is a popular website."

我想把标点符号和单词分开。输出应该是:

"stack overflow ... is a popular website . "

当然是命令 gsub("\.", " \. ", text, fixed = FALSE) returns:

"stack overflow . . . is a popular website . " 因为它不区分句点和省略号(悬点)。简而言之,当在文本中同时发现三个句点时,R 应将它们视为单个标点符号。

尝试

gsub("(?<=\.)$|(?<=\w)(?=\.)", " ", text, perl=TRUE)
#[1] "stack overflow ... is a popular website . "

gsub("(?<=\.)$|(?<=\w)(?=\.)", " ", "aaa...", perl=TRUE)
#[1] "aaa ... "

gsub("(?<=\.)(?=$|\w)|(?<=\w)(?=\.)", " ", "aaa...bbb", perl=TRUE)
#[1] "aaa ... bbb"

我认为 non-lookaround 方法会更有效率和可读性:

text="stack overflow... is a popular website."
gsub("*[[:space:]]*(\.+)[[:space:]]*", " \1 ", text)
## => [1] "stack overflow ... is a popular website . "

IDEONE demo

我更新了 post,因为标点前后需要 space。

(\.+) 周围的 [[:space:]]* 匹配零个或多个白色 space 而 (\.+) 将匹配一个或多个句点。 (...) 形成一个 捕获组 ,其值存储在编号缓冲区 #1 中,我们可以使用替换模式中的 </code> 反向引用访问该缓冲区。因此,<code> 被模式捕获的周期所取代。捕获比使用环视更有效,因为没有检查文本 before/after 当前位置的开销。

现在,如果您需要处理所有标点符号,请使用[[:punct:]]:

gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \1 ", text)

R regex help:

[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

Code demo:

text="Hi!stack overflow... is a popular website, I visit it every day."
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \1 ", text)
## => [1] "Hi ! stack overflow ... is a popular website , I visit it every day . "

连字词更新

为避免匹配带连字符的单词,您可以匹配并跳过被单词边界包围的-

text="Hi!stack-overflow... is a popular website, I visit it every day."
gsub("\b-\b(*SKIP)(*F)|\s*(\p{P}+)\s*", " \1 ", text, perl=T)
## => [1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "

demo

经过这么多评论,这个正则表达式应该最能满足您的需求:

(?:\b| )([.,:;!]+)(?: |\b)

Demo

要在 R 中使用它,反斜杠必须加倍。

所以我们最终得到:

text<-c('Hi!stack-overflow... is a popular website, I visit it every day.',
    'aaa...',
    'AAA...B"B"B',
    'AA .BBB #unlikely to happen but managed anyway')

> gsub('(?:\b| )([.,:;!]+)(?: |\b)',' \1 ',text)
[1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
[2] "aaa ... "                                                              
[3] "AAA ... B\"B\"B"                                                       
[4] "AA . BBB #unlikely to happen but managed anyway"