正则表达式 gsub R 区分省略号和句号
Regex gsub R differentiate between ellipsis and periods
text="stack overflow... is a popular website."
我想把标点符号和单词分开。输出应该是:
"stack overflow ... is a popular website . "
当然是命令 gsub("\.", " \. ", text, fixed = FALSE)
returns:
"stack overflow . . . is a popular website . "
因为它不区分句点和省略号(悬点)。简而言之,当在文本中同时发现三个句点时,R 应将它们视为单个标点符号。
尝试
gsub("(?<=\.)$|(?<=\w)(?=\.)", " ", text, perl=TRUE)
#[1] "stack overflow ... is a popular website . "
gsub("(?<=\.)$|(?<=\w)(?=\.)", " ", "aaa...", perl=TRUE)
#[1] "aaa ... "
gsub("(?<=\.)(?=$|\w)|(?<=\w)(?=\.)", " ", "aaa...bbb", perl=TRUE)
#[1] "aaa ... bbb"
我认为 non-lookaround 方法会更有效率和可读性:
text="stack overflow... is a popular website."
gsub("*[[:space:]]*(\.+)[[:space:]]*", " \1 ", text)
## => [1] "stack overflow ... is a popular website . "
我更新了 post,因为标点前后需要 space。
(\.+)
周围的 [[:space:]]*
匹配零个或多个白色 space 而 (\.+)
将匹配一个或多个句点。 (...)
形成一个 捕获组 ,其值存储在编号缓冲区 #1 中,我们可以使用替换模式中的 </code> 反向引用访问该缓冲区。因此,<code>
被模式捕获的周期所取代。捕获比使用环视更有效,因为没有检查文本 before/after 当前位置的开销。
现在,如果您需要处理所有标点符号,请使用[[:punct:]]
:
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \1 ", text)
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
text="Hi!stack overflow... is a popular website, I visit it every day."
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \1 ", text)
## => [1] "Hi ! stack overflow ... is a popular website , I visit it every day . "
连字词更新
为避免匹配带连字符的单词,您可以匹配并跳过被单词边界包围的-
:
text="Hi!stack-overflow... is a popular website, I visit it every day."
gsub("\b-\b(*SKIP)(*F)|\s*(\p{P}+)\s*", " \1 ", text, perl=T)
## => [1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
见demo
经过这么多评论,这个正则表达式应该最能满足您的需求:
(?:\b| )([.,:;!]+)(?: |\b)
要在 R 中使用它,反斜杠必须加倍。
所以我们最终得到:
text<-c('Hi!stack-overflow... is a popular website, I visit it every day.',
'aaa...',
'AAA...B"B"B',
'AA .BBB #unlikely to happen but managed anyway')
> gsub('(?:\b| )([.,:;!]+)(?: |\b)',' \1 ',text)
[1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
[2] "aaa ... "
[3] "AAA ... B\"B\"B"
[4] "AA . BBB #unlikely to happen but managed anyway"
text="stack overflow... is a popular website."
我想把标点符号和单词分开。输出应该是:
"stack overflow ... is a popular website . "
当然是命令 gsub("\.", " \. ", text, fixed = FALSE)
returns:
"stack overflow . . . is a popular website . "
因为它不区分句点和省略号(悬点)。简而言之,当在文本中同时发现三个句点时,R 应将它们视为单个标点符号。
尝试
gsub("(?<=\.)$|(?<=\w)(?=\.)", " ", text, perl=TRUE)
#[1] "stack overflow ... is a popular website . "
gsub("(?<=\.)$|(?<=\w)(?=\.)", " ", "aaa...", perl=TRUE)
#[1] "aaa ... "
gsub("(?<=\.)(?=$|\w)|(?<=\w)(?=\.)", " ", "aaa...bbb", perl=TRUE)
#[1] "aaa ... bbb"
我认为 non-lookaround 方法会更有效率和可读性:
text="stack overflow... is a popular website."
gsub("*[[:space:]]*(\.+)[[:space:]]*", " \1 ", text)
## => [1] "stack overflow ... is a popular website . "
我更新了 post,因为标点前后需要 space。
(\.+)
周围的 [[:space:]]*
匹配零个或多个白色 space 而 (\.+)
将匹配一个或多个句点。 (...)
形成一个 捕获组 ,其值存储在编号缓冲区 #1 中,我们可以使用替换模式中的 </code> 反向引用访问该缓冲区。因此,<code>
被模式捕获的周期所取代。捕获比使用环视更有效,因为没有检查文本 before/after 当前位置的开销。
现在,如果您需要处理所有标点符号,请使用[[:punct:]]
:
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \1 ", text)
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
text="Hi!stack overflow... is a popular website, I visit it every day."
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \1 ", text)
## => [1] "Hi ! stack overflow ... is a popular website , I visit it every day . "
连字词更新
为避免匹配带连字符的单词,您可以匹配并跳过被单词边界包围的-
:
text="Hi!stack-overflow... is a popular website, I visit it every day."
gsub("\b-\b(*SKIP)(*F)|\s*(\p{P}+)\s*", " \1 ", text, perl=T)
## => [1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
见demo
经过这么多评论,这个正则表达式应该最能满足您的需求:
(?:\b| )([.,:;!]+)(?: |\b)
要在 R 中使用它,反斜杠必须加倍。
所以我们最终得到:
text<-c('Hi!stack-overflow... is a popular website, I visit it every day.',
'aaa...',
'AAA...B"B"B',
'AA .BBB #unlikely to happen but managed anyway')
> gsub('(?:\b| )([.,:;!]+)(?: |\b)',' \1 ',text)
[1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
[2] "aaa ... "
[3] "AAA ... B\"B\"B"
[4] "AA . BBB #unlikely to happen but managed anyway"