如何从文本中删除非 UTF-8 字符

How to remove non UTF-8 characters from text

我需要帮助从我的词云中删除非 UTF-8 字符。到目前为止,这是我的代码。我已经尝试过 gsub 和 removeWords,它们仍然存在于我的词云中,我不知道该怎么做才能摆脱它们。任何帮助,将不胜感激。谢谢你的时间。

txt <- readLines("11-0.txt")
corpus = VCorpus(VectorSource(txt))
gsub("’","‘","",txt)

corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace) 
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","â€","project"))

tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)

wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))

编辑:这是我的 inconv 版本

txt <- readLines("11-0.txt")
Encoding(txt) <- "latin1"
iconv(txt, "latin1", "ASCII", sub="")

corpus = VCorpus(VectorSource(txt))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace) 
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","project"))

tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)

wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
title(main="Alice in Wonderland word cloud",font.main=1,cex.main =1.5)

gsub的签名是:

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

不确定你想做什么

gsub("’","‘","",txt)

但是那条线可能没有按照你的意愿去做...

有关 gsub 和非 ascii 符号的先前 SO 问题,请参阅 here

编辑:

使用 iconv 的建议解决方案:

删除所有非ASCII字符:

txt <- "’xxx‘"

iconv(txt, "latin1", "ASCII", sub="")

Returns:

[1] "xxx"