使用 TermDocumentMatrix 的 UTF-8 字符编码

Question

我正在努力学习 R。几个小时以来，我一直在努力解决这个问题。我已经搜索并尝试了很多方法来解决这个问题，但到目前为止还没有成功。所以我们开始吧；我正在从 Twitter（通过 twitteR）下载一些随机推文。当我检查我的数据框时，我可以看到所有特殊字符（比如；üğıİşçÇöÖ）。我正在删除一些东西（如空格等）。在删除和操作我的语料库之后，一切看起来都很好。当我尝试创建 TermDocumentMatrix 时，字符编码问题开始了。在那之后 "tdm" 和 "df" 有一些奇怪的符号并且可能丢失了一些字符？？这是代码；

tweetsg.df <- twListToDF(tweets)
#looks good. no encoding problems.
wordCorpus <- Corpus(VectorSource(tweetsg.df$text))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus, control = list(tokenize="scan", 
wordLengths = c(3, Inf),language="Turkish"))
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)

此时 tdm 和 df 都有奇怪的符号和缺失的字符。

到目前为止我尝试了什么；

尝试使用不同的分词器。也是定制的。
已将 Sys.setLocale 更改为我自己的语言。
使用了 enc2utf8
将我的系统 (windows 10) 显示语言更改为我自己的语言

但仍然没有运气！接受任何类型的帮助或指点:) PS：非英语人士和 R 新手。另外，如果我们能解决这个问题，我想我也有表情符号的问题。我想删除或者更好地使用它们:)

Answer 1

我已经成功地复制了你的问题，并进行了更改以获得土耳其语输出。尝试更改行

wordCorpus <- Corpus(VectorSource(tweetsg.df$text))

到

wordCorpus <- Corpus(DataframeSource(data.frame(tweetsg.df$text)))

并添加与此类似的一行。

Encoding(tweetsg.df$text)  <- "UTF-8"

我开始工作的代码是

library(tm)
sampleTurkish <- "değiştirdik değiştirdik değiştirdik"
Encoding(sampleTurkish)  <- "UTF-8"
#looks good. no encoding problems.
wordCorpus <- Corpus(DataframeSource(data.frame(sampleTurkish)))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus)
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)

print(findFreqTerms(tdm, lowfreq=2))

这仅适用于来自控制台的 source 命令。即单击运行或 RStudio 中的源按钮不起作用。我还确定我选择了 "Save with Encoding" "UTF-8"（虽然这可能只是必要的，因为我有土耳其语文本）

> source("Turkish.R")
[1] "değiştirdik"

第二个答案R tm package: utf-8 text最终有用。

Answer 2

我有一个来自 postgreSQL 数据库的 UTF-8 编码的字符串向量，该数据库抛出相同的错误，但 none 的建议解决方案有效（详情请见下文）。所以我的解决方案是使用 iconv 函数简单地将 UTF-8 转换为 latin1 。然后我可以用正常的 VectorSource 函数创建语料库。

# text: loaded from PostgreSQL in UTF-8
# convert to latin1
text <- iconv(text, "UTF-8", "latin1")

wordCorpus <- Corpus(VectorSource(text))

也许这对其他人有帮助。

对我不起作用的解决方案：首先我按照 Jeremy 的回答将 VectorSource 更改为 DataframeSource 并将编码更改为 UTF-8，但后来我得到了一个新的错误：

Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

我找到了这个线程 (Error faced while using TM package's VCorpus in R)，但是提供的为新版本的 tm 包手动创建 data.frame 的答案也没有用。

使用 TermDocumentMatrix 的 UTF-8 字符编码

UTF-8 Character Encoding with TermDocumentMatrix

r

utf-8

tm

到目前为止我尝试了什么；