当矩阵对于通常的操作来说太大时,如何删除 DFM 中的零条目?

How to remove zero entries in a DFM when the matrix is too big for usual manipulation?

我有以下问题:我将一个语料库转换成一个 dfm,这个 dfmm 有一些零条目,我需要在拟合 LDA 模型之前删除这些条目。我通常会这样做:

OutDfm <- dfm_trim(dfm(corpus, tolower = TRUE, remove = c(stopwords("english"), stopwords("german"), stopwords("french"), stopwords("italian")), remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE, stem = TRUE, verbose = TRUE), min_docfreq = 5)

Creating a dfm from a corpus input...
   ... lowercasing
   ... found 272,912 documents, 112,588 features
   ... removed 613 features
   ... stemming features (English)
, trimmed 27491 feature variants
   ... created a 272,912 x 84,515 sparse dfm
   ... complete. 
Elapsed time: 78.7 seconds.


# remove zero-entries
raw.sum=apply(OutDfm,1,FUN=sum)
which(raw.sum == 0)
OutDfm = OutDfm[raw.sum!=0,]

但是,当我尝试执行最后的操作时,我得到:Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 暗示矩阵太大而无法操作。

有没有人遇到过并解决过这个问题?删除 0 个条目的任何替代策略?

非常感谢!

您的 applysum 将 dfm 从稀疏矩阵转换为密集矩阵以计算行总和。

要么使用 slam::row_sums,因为 slam 函数适用于稀疏矩阵,但更好的是,只需使用 quantada::dfm_subset 到 select 所有具有超过 0 个标记的文档。

dfm_subset(OutDfm, ntoken(OutDfm) > 0)

展示如何使用 ntokens > 5000 的示例:

library(quanteda)
x <- corpus(data_corpus_inaugural)
x <- dfm(x)
x
Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and house representatives : among vicissitudes
  1789-Washington               1  71 116      1  48     2               2 1     1            1

# subset based on amount of tokens.
dfm_subset(x, ntoken(x) > 5000)
Document-feature matrix of: 3 documents, 9,360 features (84.1% sparse) and 4 docvars.
               features
docs            fellow-citizens  of the senate and house representatives : among vicissitudes
  1841-Harrison              11 604 829      5 231     1               4 1     3            0