DocumentTermMatrix 需要有词频加权误差
DocumentTermMatrix needs to have a term frequency weighting Error
我正尝试在相当大的数据集上使用 topicmodels 包中的 LDA()。在尝试了所有方法来修复以下错误 "In nr * nc : NAs produced by integer overflow" 和 "Each row of the input matrix needs to contain at least one non-zero entry" 之后,我最终遇到了这个错误。
ask<- read.csv('askreddit201508.csv', stringsAsFactors = F)
myDtm <- create_matrix(as.vector(ask$title), language="english", removeNumbers=TRUE, stemWords=TRUE, weighting=weightTf)
myDtm2 = removeSparseTerms(myDtm,0.99999)
myDtm2 <- rollup(myDtm2, 2, na.rm=TRUE, FUN = sum)
rowTotals <- apply(myDtm2 , 1, sum)
myDtm2 <- myDtm2[rowTotals> 0, ]
LDA2 <- LDA(myDtm2,100)
Error in LDA(myDtm2, 100) :
The DocumentTermMatrix needs to have a term frequency weighting
部分问题是您通过 tf-idf 对文档术语矩阵进行加权,但 LDA 需要术语 counts。此外,这种删除稀疏术语的方法似乎正在创建一些已删除所有术语的文档。
使用 quanteda 包可以更轻松地将文本转换为主题模型。方法如下:
require(quanteda)
myCorpus <- corpus(textfile("http://homepage.stat.uiowa.edu/~thanhtran/askreddit201508.csv",
textField = "title"))
myDfm <- dfm(myCorpus, stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 160,707 documents
## ... indexing features: 39,505 feature types
## ... stemming features (English), trimmed 12563 feature variants
## ... created a 160707 x 26942 sparse dfm
## ... complete.
# remove infrequent terms: see http://stats.stackexchange.com/questions/160539/is-this-interpretation-of-sparsity-accurate/160599#160599
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.99999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
## Features occurring in fewer than 1.60707 documents: 12579
nfeature(myDfm2)
## [1] 14363
# fit the LDA model
require(topicmodels)
LDA2 <- LDA(quantedaformat2dtm(myDfm2), 100)
all.dtm <- DocumentTermMatrix(语料库,
控制=列表(加权=weightTf));检查(all.dtm)
tpc.mdl.LDA <- LDA(all.dtm ,k=the.number.of.topics)
我正尝试在相当大的数据集上使用 topicmodels 包中的 LDA()。在尝试了所有方法来修复以下错误 "In nr * nc : NAs produced by integer overflow" 和 "Each row of the input matrix needs to contain at least one non-zero entry" 之后,我最终遇到了这个错误。
ask<- read.csv('askreddit201508.csv', stringsAsFactors = F)
myDtm <- create_matrix(as.vector(ask$title), language="english", removeNumbers=TRUE, stemWords=TRUE, weighting=weightTf)
myDtm2 = removeSparseTerms(myDtm,0.99999)
myDtm2 <- rollup(myDtm2, 2, na.rm=TRUE, FUN = sum)
rowTotals <- apply(myDtm2 , 1, sum)
myDtm2 <- myDtm2[rowTotals> 0, ]
LDA2 <- LDA(myDtm2,100)
Error in LDA(myDtm2, 100) :
The DocumentTermMatrix needs to have a term frequency weighting
部分问题是您通过 tf-idf 对文档术语矩阵进行加权,但 LDA 需要术语 counts。此外,这种删除稀疏术语的方法似乎正在创建一些已删除所有术语的文档。
使用 quanteda 包可以更轻松地将文本转换为主题模型。方法如下:
require(quanteda)
myCorpus <- corpus(textfile("http://homepage.stat.uiowa.edu/~thanhtran/askreddit201508.csv",
textField = "title"))
myDfm <- dfm(myCorpus, stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 160,707 documents
## ... indexing features: 39,505 feature types
## ... stemming features (English), trimmed 12563 feature variants
## ... created a 160707 x 26942 sparse dfm
## ... complete.
# remove infrequent terms: see http://stats.stackexchange.com/questions/160539/is-this-interpretation-of-sparsity-accurate/160599#160599
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.99999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
## Features occurring in fewer than 1.60707 documents: 12579
nfeature(myDfm2)
## [1] 14363
# fit the LDA model
require(topicmodels)
LDA2 <- LDA(quantedaformat2dtm(myDfm2), 100)
all.dtm <- DocumentTermMatrix(语料库, 控制=列表(加权=weightTf));检查(all.dtm)
tpc.mdl.LDA <- LDA(all.dtm ,k=the.number.of.topics)