朴素贝叶斯的问题

Problems with Naive Bayes

我正在尝试 运行 R 中的朴素贝叶斯根据文本数据进行预测(通过构建文档术语矩阵)。

我阅读了几篇关于训练和测试集中可能缺少术语的帖子警告,所以我决定只使用一个数据框,然后将其拆分。我使用的代码是这样的:

data <- read.csv(file="path",header=TRUE)

########## NAIVE BAYES
library(e1071)
library(SparseM)
library(tm)

# CREATE DATA FRAME AND TRAINING AND
# TEST INCLUDING 'Text' AND 'InfoType' (columns 8 and 27)
traindata <- as.data.frame(data[13000:13999,c(8,27)])
testdata <- as.data.frame(data[14000:14999,c(8,27)])
complete <- as.data.frame(data[13000:14999,c(8,27)])

# SEPARATE TEXT VECTOR TO CREATE Source(),
# Corpus() CONSTRUCTOR FOR DOCUMENT TERM
# MATRIX TAKES Source()
completevector <- as.vector(complete$Text)

# CREATE SOURCE FOR VECTORS
completesource <- VectorSource(completevector)

# CREATE CORPUS FOR DATA
completecorpus <- Corpus(completesource)

# STEM WORDS, REMOVE STOPWORDS, TRIM WHITESPACE
completecorpus <- tm_map(completecorpus,tolower)
        completecorpus <- tm_map(completecorpus,PlainTextDocument)
        completecorpus <- tm_map(completecorpus, stemDocument)
completecorpus <- tm_map(completecorpus, removeWords,stopwords("english"))
        completecorpus <- tm_map(completecorpus,removePunctuation)
        completecorpus <- tm_map(completecorpus,removeNumbers)
        completecorpus <- tm_map(completecorpus,stripWhitespace)

# CREATE DOCUMENT TERM MATRIX
completematrix<-DocumentTermMatrix(completecorpus)
trainmatrix <- completematrix[1:1000,]
testmatrix <- completematrix[1001:2000,]

# TRAIN NAIVE BAYES MODEL USING trainmatrix DATA AND traindata$InfoType CLASS VECTOR
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$InfoType),laplace=1)

# PREDICTION
results <- predict(model,as.matrix(testmatrix))
conf.matrix<-table(results, testdata$InfoType,dnn=list('predicted','actual'))

conf.matrix

问题是我得到了这样奇怪的结果:

               actual
predicted    1   2   3
         1  60 833 107
         2   0   0   0
         3   0   0   0

知道为什么会这样吗?

原始数据如下所示:

head(complete)

      Text
13000 Milkshakes, milkshakes, whats not to love? Really like the durability and weight of the cup. Something about it sure makes good milkshakes.Works beautifully with the Cuisinart smart stick.
13001 excellent. shipped on time, is excellent for protein shakes with a cuisine art mixer.  easy to clean and the mixer fits in perfectly
13002 Great cup. Simple and stainless steel great size cup for use with my cuisinart mixer.  I can do milkshakes really easy and fast. Recommended. No problems with the shipping.
13003 Wife Loves This. Stainless steel....attractive and the best part is---it won't break. We are considering purchasing another one because they are really nice.
13004 Great! Stainless steel cup is great for smoothies, milkshakes and even chopping small amounts of vegetables for salads!Wish it had a top but still love it!
13005 Great with my. Stick mixer...the plastic mixing container cracked and became unusable as a result....the only downside is you can't see if the stuff you are mixing is mixed well 

      InfoType
13000        2
13001        2
13002        2
13003        3
13004        2
13005        2

看来问题是 TDM 需要摆脱如此多的稀疏性。所以我补充说:

completematrix<-removeSparseTerms(completematrix, 0.95)

它开始工作了!!

             actual
predicted   1   2   3
        1  60 511   6
        2   0  86   2
        3   0 236  99

谢谢大家的想法(谢谢 Chelsey Hill!!)