使用双字母组的 R 中带有 tm 包的 LDA
LDA with tm package in R using bigrams
我有一个 csv,每一行都是一个文档。我需要对此执行 LDA。我有以下代码:
library(tm)
library(SnowballC)
library(topicmodels)
library(RWeka)
X = read.csv('doc.csv',sep=",",quote="\"",stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(X))
corpus <- tm_map(tm_map(tm_map(corpus, stripWhitespace), tolower), stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- DocumentTermMatrix(corpus, control = list(tokenize=BigramTokenizer,weighting=weightTfIdf))
此时检查 dtm 对象给出
<<DocumentTermMatrix (documents: 52, terms: 477)>>
Non-/sparse entries: 492/24312
Sparsity : 98%
Maximal term length: 20
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
现在我继续在此基础上执行 LDA
rowTotals <- apply(dtm , 1, sum)
dtm.new <- dtm[rowTotals> 0, ]
g = LDA(dtm.new,10,method = 'VEM',control=NULL,model=NULL)
我收到以下错误
Error in LDA(dtm.new, 10, method = "VEM", control = NULL, model = NULL) :
The DocumentTermMatrix needs to have a term frequency weighting
文档术语矩阵显然是加权的。我做错了什么?
请帮忙。
文档词条矩阵需要有词频权重:
DocumentTermMatrix(corpus,
control = list(tokenize = BigramTokenizer,
weighting = weightTf))
我有一个 csv,每一行都是一个文档。我需要对此执行 LDA。我有以下代码:
library(tm)
library(SnowballC)
library(topicmodels)
library(RWeka)
X = read.csv('doc.csv',sep=",",quote="\"",stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(X))
corpus <- tm_map(tm_map(tm_map(corpus, stripWhitespace), tolower), stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- DocumentTermMatrix(corpus, control = list(tokenize=BigramTokenizer,weighting=weightTfIdf))
此时检查 dtm 对象给出
<<DocumentTermMatrix (documents: 52, terms: 477)>>
Non-/sparse entries: 492/24312
Sparsity : 98%
Maximal term length: 20
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
现在我继续在此基础上执行 LDA
rowTotals <- apply(dtm , 1, sum)
dtm.new <- dtm[rowTotals> 0, ]
g = LDA(dtm.new,10,method = 'VEM',control=NULL,model=NULL)
我收到以下错误
Error in LDA(dtm.new, 10, method = "VEM", control = NULL, model = NULL) :
The DocumentTermMatrix needs to have a term frequency weighting
文档术语矩阵显然是加权的。我做错了什么?
请帮忙。
文档词条矩阵需要有词频权重:
DocumentTermMatrix(corpus,
control = list(tokenize = BigramTokenizer,
weighting = weightTf))