使用R中的gsub删除撇号和单词内破折号以外的标点符号,而不会意外连接两个单词

Removing punctuation except for apostrophes AND intra-word dashes with gsub in R WITHOUT accidently concatenating two words

我一直在 Whosebug 上寻找解决方案,并在 R (RStudio) 中试验了几个小时。我知道如何使用 gsub(不是使用 tm 包)在保留撇号、字内破折号和字内 &'s(对于 AT&T)的同时删除标点符号,但我想知道是否有人可以提供有关此操作的提示与以下问题同时进行)。我想知道如何防止将单词与 gsub 或任何其他正则表达式程序连接起来,我曾经删除过的标点符号所在的位置。到目前为止,这是我能做的最好的事情:

x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating  is a new**$ballgame but----why--- not?"

gsub("(\w['&-]\w)|[[:punct:]]", "\1", x, perl=TRUE) 

#[1] "Good luckSPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventingconcatenating  is a newballgame butwhy not"

有什么想法吗?这道题的目的是顺便把解法应用到一个数据框列或者社交媒体帖子的语料库中。

你可以:

  1. 匹配每个标点符号前后的所有space,替换中使用1个space
  2. 限制 [-'&] 仅在非单词边界之后或之前匹配 \B

正则表达式:

\s*(?:(?:\B[-'&]+|[-'&]+\B|[^-'&[:^punct:]]+)\s*)+
  • 请注意,我在 [^-'&[:^punct:]] 中使用双重否定从 POSIX class [:punct:]
  • 中排除 -'&

替换:

" "   (1 space)

regex101 Demo

代码:

x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating  is a new**$ballgame but----why--- not?"

gsub("\s*(?:(?:\B[-'&]+|[-'&]+\B|[^-'&[:^punct:]]+)\s*)+", " ", x, perl=TRUE)

#[1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating  is a new ballgame but why not "

ideone Demo

你可以只留下 leading/trailing 个空格,一个函数:

gsub("[[:punct:]]* *(\w+[&'-]\w+)|[[:punct:]]+ *| {2,}", " \1", x)
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not "

如果你能够使用 qdapRegex 包,你可以这样做:

library(qdapRegex)
rm_default(x, pattern = "[^ a-zA-Z&'-]|[&'-]{2,}", replacement = " ")
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not"