在 R 中的语料库上删除停用词和降低函数速度

Question

我有大约 75 MB 数据的语料库。我正在尝试使用以下命令

tm_map(doc.corpus, removeWords, stopwords("english"))
tm_map(doc.corpus, tolower)

这两个单独的功能至少需要 40 分钟才能运行。我在为我的模型使用 tdm 矩阵时正在寻找加速过程。

我经常尝试 gc() 和 memory.limit(10000000) 之类的命令，但我无法加快处理速度。

我有一个带有 4GB RAM 和运行本地数据库来读取输入数据的系统。

希望大家提提意见，加快速度！

Answer 1

首先我会尝试

tm_map(doc.corpus, content_transformer(tolower))

因为 tolower() 不在 getTransformations()

的列表中

Answer 2

也许你可以试试 quanteda

library(stringi)
library(tm)
library(quanteda)

txt <- stri_rand_lipsum(100000L)
print(object.size(txt), units = "Mb")
# 63.4 Mb

system.time(
  dfm <- dfm(txt, toLower = TRUE, ignoredFeatures = stopwords("en")) 
)
# Elapsed time: 12.3 seconds.
#        User      System verstrichen 
#       11.61        0.36       12.30 

system.time(
  dtm <- DocumentTermMatrix(
    Corpus(VectorSource(txt)), 
    control = list(tolower = TRUE, stopwords = stopwords("en"))
  )
)
#  User      System verstrichen 
# 157.16        0.38      158.69

在 R 中的语料库上删除停用词和降低函数速度

Remove stopwords and tolower function slow on a Corpus in R

performance

r

text-mining

tm