避免 R 中特殊字符的通用方法

Question

以下是一系列的邮件主题。 DF-data.frame。请注意，我已经从 excel sheet.

中导入了这个

  EmailSubject
 Buy the stunning new phone
 The game changer is here.
  Experience a phone ahead of its time.
  Thank You Chennai
   Limited Period offer
   Valentines day special
  Buy a phone at 10000 and get a new sim free
   Limited Period offer
  Valentines day special
  Buy a phone at 10000 and get a new sim free
  Buy the stunning new phone
  The game changer is here.
  Experience a phone ahead of its time.
  Thank You Chennai
   Limited Period offer
   Valentines day special
  Buy a phone at 10000 and get a new sim free
 Thank You Chennai
Limited Period offer
 Valentines day special
 Buy a phone at 10000 and get a new sim free
 Buy a phone at 10000 and get a new sim free
 Buy the stunning new phone
 The game changer is here.

我使用以下代码在 R 中创建了一个术语文档矩阵

 require(tm)
 mytext<-DF$EmailSubject
 mycorpus<-Corpus(VectorSource(mytext))
 mycorpus<-tm_map(mycorpus,removePunctuation)
 mycorpus<-tm_map(mycorpus, removeNumbers)
 mycorpus<-tm_map(mycorpus, tolower)
 mycorpus<-tm_map(mycorpus, removeWords, stopwords("english"))


    # # Create a term diocumentmatrix
    dtm<-TermDocumentMatrix(mycorpus)
     m<-as.matrix(dtm)
     v <- sort(rowSums(m),decreasing=TRUE)
     d <- data.frame(word = names(v),freq=v)
     head(d, 10)

这会产生以下术语文档矩阵

                          word freq

                          get   45
                          free   44
                          edge   35

                          new   29
                          buy   24
                        charger   23
                        wireless   23
                          just   21
                          month   21
                            per   21
                        starting   21
                        stunning   21
                            pro   20
                            now   17
                         offers   17
                           gear   16
                       exclusive   15
                          offer   14
                           gift   13

                       irresistible   10
                           loved   10
                    valentineâ€™s   10

我正在获取术语文档矩阵。然而，一些单词仅在术语文档矩阵中出现带有特殊字符——它们不存在于原始数据框中。我试过调整编码并手动删除了 Gsub 的编码。有没有办法避免我的 excel sheet 中的单词被处理为特殊字符。

gsub("€™", "", d$word)

这是一种手动方法。有没有自动的方法。编码是UTF-8。有没有包可以让我们避免这个错误

Answer 1

这应该对你有帮助：

Encoding(x) <- "UTF-8"

iconv(dtm, "UTF-8", "ASCII", sub="")

避免 R 中特殊字符的通用方法

Generic way to avoid special characters in R

r

gsub

dataframe

term-document-matrix