如何从文本中删除非 UTF-8 字符
How to remove non UTF-8 characters from text
我需要帮助从我的词云中删除非 UTF-8 字符。到目前为止,这是我的代码。我已经尝试过 gsub 和 removeWords,它们仍然存在于我的词云中,我不知道该怎么做才能摆脱它们。任何帮助,将不胜感激。谢谢你的时间。
txt <- readLines("11-0.txt")
corpus = VCorpus(VectorSource(txt))
gsub("’","‘","",txt)
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace)
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","â€","project"))
tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)
wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
编辑:这是我的 inconv 版本
txt <- readLines("11-0.txt")
Encoding(txt) <- "latin1"
iconv(txt, "latin1", "ASCII", sub="")
corpus = VCorpus(VectorSource(txt))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace)
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","project"))
tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)
wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
title(main="Alice in Wonderland word cloud",font.main=1,cex.main =1.5)
gsub
的签名是:
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
不确定你想做什么
gsub("’","‘","",txt)
但是那条线可能没有按照你的意愿去做...
有关 gsub 和非 ascii 符号的先前 SO 问题,请参阅 here。
编辑:
使用 iconv
的建议解决方案:
删除所有非ASCII字符:
txt <- "’xxx‘"
iconv(txt, "latin1", "ASCII", sub="")
Returns:
[1] "xxx"
我需要帮助从我的词云中删除非 UTF-8 字符。到目前为止,这是我的代码。我已经尝试过 gsub 和 removeWords,它们仍然存在于我的词云中,我不知道该怎么做才能摆脱它们。任何帮助,将不胜感激。谢谢你的时间。
txt <- readLines("11-0.txt")
corpus = VCorpus(VectorSource(txt))
gsub("’","‘","",txt)
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace)
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","â€","project"))
tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)
wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
编辑:这是我的 inconv 版本
txt <- readLines("11-0.txt")
Encoding(txt) <- "latin1"
iconv(txt, "latin1", "ASCII", sub="")
corpus = VCorpus(VectorSource(txt))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace)
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","project"))
tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)
wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
title(main="Alice in Wonderland word cloud",font.main=1,cex.main =1.5)
gsub
的签名是:
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
不确定你想做什么
gsub("’","‘","",txt)
但是那条线可能没有按照你的意愿去做...
有关 gsub 和非 ascii 符号的先前 SO 问题,请参阅 here。
编辑:
使用 iconv
的建议解决方案:
删除所有非ASCII字符:
txt <- "’xxx‘"
iconv(txt, "latin1", "ASCII", sub="")
Returns:
[1] "xxx"