在输入列中对 dfm 进行预处理，而无需创建 dfm

Question

有这样的数据框

dataf <- data.frame(id = c(1,2,3,4), text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s","Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now","There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour","a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum",""))

可以使用dfm的构造进行文本分析预处理

myDfm <- myCorpus %>%
     tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)  %>%
     tokens_remove(pattern = c(stopwords(source = "smart"), mystopwords))  %>% tokens_wordstem() %>% 
     dfm(verbose = FALSE) %>% dfm_trim(min_docfreq = 3, min_termfreq = 5)

是否有任何替代选项来删除停用词停用词（source =“smart”），制作词干并制作 trim min_docfreq = 3, min_termfreq = 5 in无需创建 dfm 的文本列？

Answer 1

我会根据问题和评论来回答这个问题，因为您似乎需要一个 dgCMatrix class 来完成您想要做的事情。（这是 textmineR::CreateDtm() 返回的内容。）幸运的是，quanteda dfm 已经是一种特殊类型的 dgCMatrix。所以它可能会按原样工作，但如果你愿意，它也很容易转换——只需使用 as().

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
data(nih_sample, package = "textmineR")

dfmat <- nih_sample %>%
  corpus(text_field = "ABSTRACT_TEXT", docid_field = "APPLICATION_ID") %>%
  tokens() %>%
  tokens_ngrams(n = 1:2) %>%
  dfm()
dtm2 <- as(dfmat, "dgCMatrix")

现在，dtm2 应该与博客 [=29=] 中的 dtm 一样工作。（features/columns 的顺序不同，但这对于将输入到主题模型的矩阵来说应该无关紧要。）而且：这是一个非常干净的过程。

您可以根据需要从 quanteda.

中随意插入额外的 tokens() 选项或 dfm_trim() 等

在输入列中对 dfm 进行预处理，而无需创建 dfm

Make the preprocessing of a dfm in the input column without the need to create the dfm

r

quanteda