R 中的 DocumentTermMatrix 以 2 为基数计算 Idf
DocumentTermMatrix in R is computing Idf with respect to base 2
我正在使用以下 R 代码来计算 tf-idf:
library(tm)
library(SnowballC)
docs <- c(D1 = "The sky is blue", D2 = "The sun is bright", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs)) #Make a corpus object from a text vector
#Clean the text
dd <- tm_map(dd, stripWhitespace)
dd <- tm_map(dd, content_transformer(tolower))
dd <- tm_map(dd, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm);
我得到的结果如下:
Terms
Docs blue bright sky sun
1 0.7924813 0.0000000 0.2924813 0.0000000
2 0.0000000 0.2924813 0.0000000 0.2924813
3 0.0000000 0.1949875 0.1949875 0.1949875
但是,如果我进行手算,结果就会不匹配。
我注意到的是,在 R IDF 中计算为 log2(documents/Number 中包含术语 t 的文档总数)。
有没有办法在 R 中覆盖从 2 到 10 的对数底数?
请推荐
尝试编写自己的函数
weightTfIdf.log10 <- function (m, normalize = TRUE)
{
isDTM <- inherits(m, "DocumentTermMatrix")
if (isDTM)
m <- t(m)
if (normalize) {
cs <- col_sums(m)
if (any(cs == 0))
warning("empty document(s): ", paste(Docs(m)[cs ==
0], collapse = " "))
names(cs) <- seq_len(nDocs(m))
m$v <- m$v/cs[m$j]
}
rs <- row_sums(m > 0)
if (any(rs == 0))
warning("unreferenced term(s): ", paste(Terms(m)[rs ==
0], collapse = " "))
lnrs <- log10(nDocs(m)/rs)
lnrs[!is.finite(lnrs)] <- 0
m <- m * lnrs
attr(m, "weighting") <- c(sprintf("%s%s", "term frequency - inverse document frequency",
if (normalize) " (normalized)" else ""), "tf-idf")
if (isDTM)
t(m)
else m
}
environment(weightTfIdf.log10) <- environment(TermDocumentMatrix)
dtm <- TermDocumentMatrix(dd, control = list(weighting = weightTfIdf.log10))
as.matrix(dtm)
# Docs
# Terms 1 2 3
# blue 0.23856063 0.00000000 0.00000000
# bright 0.00000000 0.23856063 0.00000000
# bright. 0.00000000 0.00000000 0.15904042
# sky 0.08804563 0.00000000 0.05869709
# sun 0.00000000 0.08804563 0.05869709
我正在使用以下 R 代码来计算 tf-idf:
library(tm)
library(SnowballC)
docs <- c(D1 = "The sky is blue", D2 = "The sun is bright", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs)) #Make a corpus object from a text vector
#Clean the text
dd <- tm_map(dd, stripWhitespace)
dd <- tm_map(dd, content_transformer(tolower))
dd <- tm_map(dd, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm);
我得到的结果如下:
Terms
Docs blue bright sky sun
1 0.7924813 0.0000000 0.2924813 0.0000000
2 0.0000000 0.2924813 0.0000000 0.2924813
3 0.0000000 0.1949875 0.1949875 0.1949875
但是,如果我进行手算,结果就会不匹配。 我注意到的是,在 R IDF 中计算为 log2(documents/Number 中包含术语 t 的文档总数)。
有没有办法在 R 中覆盖从 2 到 10 的对数底数? 请推荐
尝试编写自己的函数
weightTfIdf.log10 <- function (m, normalize = TRUE)
{
isDTM <- inherits(m, "DocumentTermMatrix")
if (isDTM)
m <- t(m)
if (normalize) {
cs <- col_sums(m)
if (any(cs == 0))
warning("empty document(s): ", paste(Docs(m)[cs ==
0], collapse = " "))
names(cs) <- seq_len(nDocs(m))
m$v <- m$v/cs[m$j]
}
rs <- row_sums(m > 0)
if (any(rs == 0))
warning("unreferenced term(s): ", paste(Terms(m)[rs ==
0], collapse = " "))
lnrs <- log10(nDocs(m)/rs)
lnrs[!is.finite(lnrs)] <- 0
m <- m * lnrs
attr(m, "weighting") <- c(sprintf("%s%s", "term frequency - inverse document frequency",
if (normalize) " (normalized)" else ""), "tf-idf")
if (isDTM)
t(m)
else m
}
environment(weightTfIdf.log10) <- environment(TermDocumentMatrix)
dtm <- TermDocumentMatrix(dd, control = list(weighting = weightTfIdf.log10))
as.matrix(dtm)
# Docs
# Terms 1 2 3
# blue 0.23856063 0.00000000 0.00000000
# bright 0.00000000 0.23856063 0.00000000
# bright. 0.00000000 0.00000000 0.15904042
# sky 0.08804563 0.00000000 0.05869709
# sun 0.00000000 0.08804563 0.05869709