当矩阵对于通常的操作来说太大时,如何删除 DFM 中的零条目?
How to remove zero entries in a DFM when the matrix is too big for usual manipulation?
我有以下问题:我将一个语料库转换成一个 dfm,这个 dfmm 有一些零条目,我需要在拟合 LDA 模型之前删除这些条目。我通常会这样做:
OutDfm <- dfm_trim(dfm(corpus, tolower = TRUE, remove = c(stopwords("english"), stopwords("german"), stopwords("french"), stopwords("italian")), remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE, stem = TRUE, verbose = TRUE), min_docfreq = 5)
Creating a dfm from a corpus input...
... lowercasing
... found 272,912 documents, 112,588 features
... removed 613 features
... stemming features (English)
, trimmed 27491 feature variants
... created a 272,912 x 84,515 sparse dfm
... complete.
Elapsed time: 78.7 seconds.
# remove zero-entries
raw.sum=apply(OutDfm,1,FUN=sum)
which(raw.sum == 0)
OutDfm = OutDfm[raw.sum!=0,]
但是,当我尝试执行最后的操作时,我得到:Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
暗示矩阵太大而无法操作。
有没有人遇到过并解决过这个问题?删除 0 个条目的任何替代策略?
非常感谢!
您的 apply
和 sum
将 dfm 从稀疏矩阵转换为密集矩阵以计算行总和。
要么使用 slam::row_sums
,因为 slam 函数适用于稀疏矩阵,但更好的是,只需使用 quantada::dfm_subset
到 select 所有具有超过 0 个标记的文档。
dfm_subset(OutDfm, ntoken(OutDfm) > 0)
展示如何使用 ntokens > 5000 的示例:
library(quanteda)
x <- corpus(data_corpus_inaugural)
x <- dfm(x)
x
Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and house representatives : among vicissitudes
1789-Washington 1 71 116 1 48 2 2 1 1 1
# subset based on amount of tokens.
dfm_subset(x, ntoken(x) > 5000)
Document-feature matrix of: 3 documents, 9,360 features (84.1% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and house representatives : among vicissitudes
1841-Harrison 11 604 829 5 231 1 4 1 3 0
我有以下问题:我将一个语料库转换成一个 dfm,这个 dfmm 有一些零条目,我需要在拟合 LDA 模型之前删除这些条目。我通常会这样做:
OutDfm <- dfm_trim(dfm(corpus, tolower = TRUE, remove = c(stopwords("english"), stopwords("german"), stopwords("french"), stopwords("italian")), remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE, stem = TRUE, verbose = TRUE), min_docfreq = 5)
Creating a dfm from a corpus input...
... lowercasing
... found 272,912 documents, 112,588 features
... removed 613 features
... stemming features (English)
, trimmed 27491 feature variants
... created a 272,912 x 84,515 sparse dfm
... complete.
Elapsed time: 78.7 seconds.
# remove zero-entries
raw.sum=apply(OutDfm,1,FUN=sum)
which(raw.sum == 0)
OutDfm = OutDfm[raw.sum!=0,]
但是,当我尝试执行最后的操作时,我得到:Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
暗示矩阵太大而无法操作。
有没有人遇到过并解决过这个问题?删除 0 个条目的任何替代策略?
非常感谢!
您的 apply
和 sum
将 dfm 从稀疏矩阵转换为密集矩阵以计算行总和。
要么使用 slam::row_sums
,因为 slam 函数适用于稀疏矩阵,但更好的是,只需使用 quantada::dfm_subset
到 select 所有具有超过 0 个标记的文档。
dfm_subset(OutDfm, ntoken(OutDfm) > 0)
展示如何使用 ntokens > 5000 的示例:
library(quanteda)
x <- corpus(data_corpus_inaugural)
x <- dfm(x)
x
Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and house representatives : among vicissitudes
1789-Washington 1 71 116 1 48 2 2 1 1 1
# subset based on amount of tokens.
dfm_subset(x, ntoken(x) > 5000)
Document-feature matrix of: 3 documents, 9,360 features (84.1% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and house representatives : among vicissitudes
1841-Harrison 11 604 829 5 231 1 4 1 3 0