row_sums 与 findFreqTerms 对比,用于子集 TermDocMatrix 以包含具有给定最小频率的词
row_sums vs findFreqTerms for subsetting TermDocMatrix to include words with a given min frequency
我的问题很简单。我有一个(二进制)TDM,我想减少行数以仅包含出现在至少两个文档中的那些行:
我认为这两种方法在二进制矩阵中会产生相同的结果:
> rowTotals = row_sums(tdm)
> dtm2 <- tdm[which(rowTotals > 2),]
> dtm2
<<TermDocumentMatrix (terms: 208361, documents: 763717)>>
Non-/sparse entries: 34812736/159094025101
Sparsity : 100%
Maximal term length: 154
Weighting : binary (bin)
> #alternative probably faster:
> atleast2 <- findFreqTerms(tdm, lowfreq = 2)
> dtm2 <- tdm[atleast2,]
> dtm2
<<TermDocumentMatrix (terms: 340436, documents: 763717)>>
Non-/sparse entries: 35076886/259961683726
Sparsity : 100%
Maximal term length: 308
Weighting : binary (bin)
但事实并非如此.. 你能帮忙弄清楚为什么不是吗?
它们产生完全相同的结果。你的第二部分有错误。您使用的频率为 2 或更高,而在第一部分中您使用的频率为 3 或更高的所有单词。如果确保两个选择标准相同,您将看到它们将产生相同的结果。请参阅下面的代码示例。还有速度对比。
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(crude)
# via row_totals
row_totals <- slam::row_sums(tdm)
dtm_via_rowtotals <- tdm[which(row_totals > 2),]
<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity : 82%
Maximal term length: 13
Weighting : term frequency (tf)
# via findFreqTerms
freq_terms <- findFreqTerms(tdm, lowfreq = 3)
dtm_via_freq_terms <- tdm[freq_terms, ]
<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity : 82%
Maximal term length: 13
Weighting : term frequency (tf)
它们一样吗?
all.equal(dtm_via_rowtotals, dtm_via_freq_terms)
[1] TRUE
速度:
microbenchmark::microbenchmark(row_totals = {rowtotals <- slam::row_sums(tdm); dtm_via_rowtotals <- tdm[which(rowtotals > 2),]},
freq_terms = {freq_terms <- findFreqTerms(tdm, lowfreq = 3); dtm_via_freq_terms <- tdm[freq_terms, ]},
times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
row_totals 1.5039 1.6347 1.885161 1.7106 1.86085 9.3405 1000
freq_terms 1.5696 1.6895 2.039345 1.7760 1.93525 99.0942 1000
通过row_totals进行选择稍微快一些。但那是因为 findFreqTerms
实际上使用 row_sums
来获取信息并且有一些额外的代码行来检查你是否传递给它一个文档术语矩阵以及你请求的频率是否是实际数字。
我的问题很简单。我有一个(二进制)TDM,我想减少行数以仅包含出现在至少两个文档中的那些行:
我认为这两种方法在二进制矩阵中会产生相同的结果:
> rowTotals = row_sums(tdm)
> dtm2 <- tdm[which(rowTotals > 2),]
> dtm2
<<TermDocumentMatrix (terms: 208361, documents: 763717)>>
Non-/sparse entries: 34812736/159094025101
Sparsity : 100%
Maximal term length: 154
Weighting : binary (bin)
> #alternative probably faster:
> atleast2 <- findFreqTerms(tdm, lowfreq = 2)
> dtm2 <- tdm[atleast2,]
> dtm2
<<TermDocumentMatrix (terms: 340436, documents: 763717)>>
Non-/sparse entries: 35076886/259961683726
Sparsity : 100%
Maximal term length: 308
Weighting : binary (bin)
但事实并非如此.. 你能帮忙弄清楚为什么不是吗?
它们产生完全相同的结果。你的第二部分有错误。您使用的频率为 2 或更高,而在第一部分中您使用的频率为 3 或更高的所有单词。如果确保两个选择标准相同,您将看到它们将产生相同的结果。请参阅下面的代码示例。还有速度对比。
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(crude)
# via row_totals
row_totals <- slam::row_sums(tdm)
dtm_via_rowtotals <- tdm[which(row_totals > 2),]
<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity : 82%
Maximal term length: 13
Weighting : term frequency (tf)
# via findFreqTerms
freq_terms <- findFreqTerms(tdm, lowfreq = 3)
dtm_via_freq_terms <- tdm[freq_terms, ]
<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity : 82%
Maximal term length: 13
Weighting : term frequency (tf)
它们一样吗?
all.equal(dtm_via_rowtotals, dtm_via_freq_terms)
[1] TRUE
速度:
microbenchmark::microbenchmark(row_totals = {rowtotals <- slam::row_sums(tdm); dtm_via_rowtotals <- tdm[which(rowtotals > 2),]},
freq_terms = {freq_terms <- findFreqTerms(tdm, lowfreq = 3); dtm_via_freq_terms <- tdm[freq_terms, ]},
times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
row_totals 1.5039 1.6347 1.885161 1.7106 1.86085 9.3405 1000
freq_terms 1.5696 1.6895 2.039345 1.7760 1.93525 99.0942 1000
通过row_totals进行选择稍微快一些。但那是因为 findFreqTerms
实际上使用 row_sums
来获取信息并且有一些额外的代码行来检查你是否传递给它一个文档术语矩阵以及你请求的频率是否是实际数字。