从文本中删除所有标点符号,包括 tm 包的撇号

Remove all punctuation from text including apostrophes for tm package

我有一个包含推文(只是消息文本)的向量,我正在清理这些向量以用于文本挖掘。我使用了 tm 包中的 removePunctuation,如下所示:

clean_tweet_text = removePunctuation(tweet_text)

这导致一个向量从文本中删除了所有标点符号 除了 撇号,这破坏了我的关键字搜索,因为没有注册涉及撇号的词。例如,我的一个关键字是 climate,但如果推文具有 'climate,则不会被计算在内。

如何从我的向量中删除所有 apostrophes/single 引号?

这是来自 dput 的 header 的可重现示例:

c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap", 
"who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…", 
"rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…", 
"better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", 
"i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", 
"why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl", 
"ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck", 
"unusual warming kills gulf of maine cod  discovery news globalwarming  httpstco39uvock3xe", 
"this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc", 
"what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"
)

要删除所有标点符号(包括撇号和单引号),您只需使用 gsub():

x <- c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap",
       "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…",
       "rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…",
       "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
       "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
       "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl",
       "ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck",
       "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe",
       "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc",
       "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o")

gsub("[[:punct:]]", "", x)
#>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
#>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
#>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
#>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
#>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
#>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
#>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
#>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
#>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"

gsub() 将其第三个参数中所有出现的第一个参数替换为其第二个参数(参见 help("gsub"))。在这里,这意味着它将集合 [[:punct:]] 中任何字符的向量 x 中所有出现的地方替换为 ""(删除它们)。

删除了哪些字符?来自 help("regex"):

[:punct:]

    Punctuation characters:
    ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

更新

出现这种情况是因为您的撇号类似于 而不是 '。所以,如果你想坚持使用 tm::removePunctuation(),你也可以使用

tm::removePunctuation(x, ucp = TRUE)
#>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
#>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
#>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
#>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
#>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
#>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
#>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
#>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
#>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"