如何为 R 主题模型正确编码 UTF-8 txt 文件

Question

类似的问题已经在这个论坛上讨论过（例如here and here），但是我没有找到解决我的问题的，所以我对一个看似相似的问题表示歉意。

我有一组采用 UTF-8 编码的 .txt 文件（见屏幕截图）。我正在尝试使用 tm 包运行 R 中的主题模型。然而，尽管在创建语料库时使用了 encoding = "UTF-8" ，但我在编码方面遇到了明显的问题。例如，我得到 < U+FB01 >scal 而不是 fiscal，in< U+FB02>uenc 而不是 influence，并不是所有的标点符号都被删除并且一些字母无法识别（例如，在某些情况下引号仍然存在，例如 view” 或plan' 或 ændring 或孤立的引号，如“和”或 zit 或 年——因此 带有一个应该被删除的破折号）。这些术语也出现在术语的主题分布中。之前在编码上遇到了一些问题，但是使用"encoding = "UTF-8"创建语料库解决了这个问题。这次好像没用了

我在 Windows 10 x64，R 版本 3.6.0 (2019-04-26)，0.7-7 版本的 tm 包（都是最新的）。我将不胜感激有关如何解决该问题的任何建议。

library(tm)
library(beepr)
library(ggplot2)
library(topicmodels)
library(wordcloud)
library(reshape2)
library(dplyr)
library(tidytext)
library(scales)
library(ggthemes)
library(ggrepel)
library(tidyr)


inputdir<-"c:/txtfiles/"
docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))

#Preprocessing
docs <-tm_map(docs,content_transformer(tolower))

removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
docs <- tm_map(docs, content_transformer(removeURL))

toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, "\.")
docs <- tm_map(docs, toSpace, "\-")


docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs,stemDocument)

dtm <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm))
ord <- order(freq, decreasing=TRUE)
write.csv(freq[ord],file=paste("word_freq.csv"))

#Topic model
  ldaOut <-LDA(dtm,k, method="Gibbs", 
               control=list(nstart=nstart, seed = seed, best=best, 
                            burnin = burnin, iter = iter, thin=thin))

编辑：我应该在 cse 中添加事实证明，txt 文件是使用以下 R 代码从 PDF 创建的：

inputdir <-"c:/pdf/"
myfiles <- list.files(path = inputdir, pattern = "pdf",  full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Users/Delt/AppData/Local/Programs/MiKTeX 2.9/miktex/bin/x64/pdftotext.exe"',
                                         paste0('"', i, '"')), wait = FALSE) )

可以下载两个示例 txt 文件here。

Answer 1

我找到了一种解决方法，它似乎可以在您提供的 2 个示例文件上正常工作。您首先需要做的是NFKD (Compatibility Decomposition)。这会将“fi”正字连字拆分为 f 和 i。幸运的是 stringi 包可以处理这个。所以在做所有特殊文本清理之前，你需要应用函数stringi::stri_trans_nfkd。您可以在下一个步骤之后（或之前）的预处理步骤中执行此操作。

请阅读此函数的文档和参考资料。

library(tm)
docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))

#Preprocessing
docs <-tm_map(docs,content_transformer(tolower))

# use stringi to fix all the orthographic ligature issues 
docs <- tm_map(docs, content_transformer(stringi::stri_trans_nfkd))

toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))

# add following line as well to remove special quotes. 
# this uses a replace from textclean to replace the weird quotes 
# which later get removed with removePunctuation
docs <- tm_map(docs, content_transformer(textclean::replace_curly_quote))

....
rest of process
.....

如何为 R 主题模型正确编码 UTF-8 txt 文件

How to properly encode UTF-8 txt files for R topic model

encoding

nlp

r

utf-8

topic-modeling