row_sums 与 findFreqTerms 对比,用于子集 TermDocMatrix 以包含具有给定最小频率的词

row_sums vs findFreqTerms for subsetting TermDocMatrix to include words with a given min frequency

我的问题很简单。我有一个(二进制)TDM,我想减少行数以仅包含出现在至少两个文档中的那些行:

我认为这两种方法在二进制矩阵中会产生相同的结果:

> rowTotals = row_sums(tdm)
> dtm2 <- tdm[which(rowTotals > 2),]
> dtm2
<<TermDocumentMatrix (terms: 208361, documents: 763717)>>
Non-/sparse entries: 34812736/159094025101
Sparsity           : 100%
Maximal term length: 154
Weighting          : binary (bin)

> #alternative probably faster:
> atleast2 <- findFreqTerms(tdm, lowfreq = 2)
> dtm2 <- tdm[atleast2,]
> dtm2
<<TermDocumentMatrix (terms: 340436, documents: 763717)>>
Non-/sparse entries: 35076886/259961683726
Sparsity           : 100%
Maximal term length: 308
Weighting          : binary (bin)

但事实并非如此.. 你能帮忙弄清楚为什么不是吗?

它们产生完全相同的结果。你的第二部分有错误。您使用的频率为 2 或更高,而在第一部分中您使用的频率为 3 或更高的所有单词。如果确保两个选择标准相同,您将看到它们将产生相同的结果。请参阅下面的代码示例。还有速度对比。

library(tm)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, removeWords, stopwords("english"))

tdm <- TermDocumentMatrix(crude)

# via row_totals
row_totals <- slam::row_sums(tdm)
dtm_via_rowtotals <- tdm[which(row_totals > 2),]

<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity           : 82%
Maximal term length: 13
Weighting          : term frequency (tf)

# via findFreqTerms
freq_terms <- findFreqTerms(tdm, lowfreq = 3)
dtm_via_freq_terms <- tdm[freq_terms, ]

<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity           : 82%
Maximal term length: 13
Weighting          : term frequency (tf)

它们一样吗?

all.equal(dtm_via_rowtotals, dtm_via_freq_terms)
[1] TRUE

速度:

microbenchmark::microbenchmark(row_totals = {rowtotals <- slam::row_sums(tdm); dtm_via_rowtotals <- tdm[which(rowtotals > 2),]},
                               freq_terms = {freq_terms <- findFreqTerms(tdm, lowfreq = 3); dtm_via_freq_terms <- tdm[freq_terms, ]},
                               times = 1000L)

Unit: milliseconds
       expr    min     lq     mean median      uq     max neval
 row_totals 1.5039 1.6347 1.885161 1.7106 1.86085  9.3405  1000
 freq_terms 1.5696 1.6895 2.039345 1.7760 1.93525 99.0942  1000

通过row_totals进行选择稍微快一些。但那是因为 findFreqTerms 实际上使用 row_sums 来获取信息并且有一些额外的代码行来检查你是否传递给它一个文档术语矩阵以及你请求的频率是否是实际数字。