避免 R 中特殊字符的通用方法
Generic way to avoid special characters in R
以下是一系列的邮件主题。 DF-data.frame。请注意,我已经从 excel sheet.
中导入了这个
EmailSubject
Buy the stunning new phone
The game changer is here.
Experience a phone ahead of its time.
Thank You Chennai
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Buy the stunning new phone
The game changer is here.
Experience a phone ahead of its time.
Thank You Chennai
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Thank You Chennai
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Buy a phone at 10000 and get a new sim free
Buy the stunning new phone
The game changer is here.
我使用以下代码在 R 中创建了一个术语文档矩阵
require(tm)
mytext<-DF$EmailSubject
mycorpus<-Corpus(VectorSource(mytext))
mycorpus<-tm_map(mycorpus,removePunctuation)
mycorpus<-tm_map(mycorpus, removeNumbers)
mycorpus<-tm_map(mycorpus, tolower)
mycorpus<-tm_map(mycorpus, removeWords, stopwords("english"))
# # Create a term diocumentmatrix
dtm<-TermDocumentMatrix(mycorpus)
m<-as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
这会产生以下术语文档矩阵
word freq
get 45
free 44
edge 35
new 29
buy 24
charger 23
wireless 23
just 21
month 21
per 21
starting 21
stunning 21
pro 20
now 17
offers 17
gear 16
exclusive 15
offer 14
gift 13
irresistible 10
loved 10
valentine’s 10
我正在获取术语文档矩阵。然而,一些单词仅在术语文档矩阵中出现带有特殊字符——它们不存在于原始数据框中。我试过调整编码并手动删除了 Gsub 的编码。有没有办法避免我的 excel sheet 中的单词被处理为特殊字符。
gsub("€™", "", d$word)
这是一种手动方法。有没有自动的方法。编码是UTF-8。有没有包可以让我们避免这个错误
这应该对你有帮助:
Encoding(x) <- "UTF-8"
iconv(dtm, "UTF-8", "ASCII", sub="")
以下是一系列的邮件主题。 DF-data.frame。请注意,我已经从 excel sheet.
中导入了这个 EmailSubject
Buy the stunning new phone
The game changer is here.
Experience a phone ahead of its time.
Thank You Chennai
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Buy the stunning new phone
The game changer is here.
Experience a phone ahead of its time.
Thank You Chennai
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Thank You Chennai
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Buy a phone at 10000 and get a new sim free
Buy the stunning new phone
The game changer is here.
我使用以下代码在 R 中创建了一个术语文档矩阵
require(tm)
mytext<-DF$EmailSubject
mycorpus<-Corpus(VectorSource(mytext))
mycorpus<-tm_map(mycorpus,removePunctuation)
mycorpus<-tm_map(mycorpus, removeNumbers)
mycorpus<-tm_map(mycorpus, tolower)
mycorpus<-tm_map(mycorpus, removeWords, stopwords("english"))
# # Create a term diocumentmatrix
dtm<-TermDocumentMatrix(mycorpus)
m<-as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
这会产生以下术语文档矩阵
word freq
get 45
free 44
edge 35
new 29
buy 24
charger 23
wireless 23
just 21
month 21
per 21
starting 21
stunning 21
pro 20
now 17
offers 17
gear 16
exclusive 15
offer 14
gift 13
irresistible 10
loved 10
valentine’s 10
我正在获取术语文档矩阵。然而,一些单词仅在术语文档矩阵中出现带有特殊字符——它们不存在于原始数据框中。我试过调整编码并手动删除了 Gsub 的编码。有没有办法避免我的 excel sheet 中的单词被处理为特殊字符。
gsub("€™", "", d$word)
这是一种手动方法。有没有自动的方法。编码是UTF-8。有没有包可以让我们避免这个错误
这应该对你有帮助:
Encoding(x) <- "UTF-8"
iconv(dtm, "UTF-8", "ASCII", sub="")