在 R 中训练朴素贝叶斯模型时出现问题
Problem when training Naive Bayes model in R
我正在使用 Caret 包(没有太多使用 Caret 的经验)来使用下面 R 代码中概述的朴素贝叶斯训练我的数据。我在执行 "nb_model" 时遇到包含句子的问题,因为它会产生一系列错误消息,它们是:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in
predict.NaiveBayes(modelFit, newdata) :
Not all variable names used in object found in newdata
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in
NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) :
请问您能否就如何调整下面的 R 代码来解决这个问题提出建议?
Dataset used in the R code below
数据集的简要示例(10 个变量):
Over arrested at in | Negative | Negative | Neutral | Neutral | Neutral | Negative |
Positive | Neutral | Negative
library(caret)
# Loading dataset
setwd("directory/path")
TrainSet = read.csv("textsent.csv", header = FALSE)
# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]
# Declaring the trainControl function
train_ctrl = trainControl(
method = "cv", #Specifying Cross validation
number = 3, # Specifying 3-fold
)
nb_model = train(
V10 ~., # Specifying the response variable and the feature variables
method = "nb", # Specifying the model to use
data = train,
trControl = train_ctrl,
)
# Get the predictions of your model in the test set
predictions = predict(nb_model, newdata = test)
# See the confusion matrix of your model in the test set
confusionMatrix(predictions, test$V10)
数据集全部为字符数据。在该数据中,有易于编码的单词 (V2
- V10
) 和句子的组合,您可以对其进行任意数量的特征工程并生成任意数量的特征。
要阅读有关文本挖掘的信息,请查看 tm
包、其文档或链接文章中的 hack-r.com for practical examples. Here's some Github code 等博客。
好的,所以我首先设置 stringsAsFactors = F
因为你的 V1
有很多独特的句子
TrainSet <- read.csv(url("https://raw.githubusercontent.com/jcool12/dataset/master/textsentiment.csv?token=AA4LAP5VXI6I7FRKMT6HDPK6U5XBY"),
header = F,
stringsAsFactors = F)
library(caret)
然后我做了特征工程
## Feature Engineering
# V2 - V10
TrainSet[TrainSet=="Negative"] <- 0
TrainSet[TrainSet=="Positive"] <- 1
# V1 - not sure what you wanted to do with this
# but here's a simple example of what
# you could do
TrainSet$V1 <- grepl("london", TrainSet$V1) # tests if london is in the string
然后它起作用了,尽管您需要改进 V1
的工程(或放弃它)以获得更好的结果。
# In reality you could probably generate 20+ decent features from this text
# word count, tons of stuff... see the tm package
# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]
# Declaring the trainControl function
train_ctrl = trainControl(
method = "cv", # Specifying Cross validation
number = 3, # Specifying 3-fold
)
nb_model = train(
V10 ~., # Specifying the response variable and the feature variables
method = "nb", # Specifying the model to use
data = train,
trControl = train_ctrl,
)
# Resampling: Cross-Validated (3 fold)
# Summary of sample sizes: 799, 800, 801
# Resampling results across tuning parameters:
#
# usekernel Accuracy Kappa
# FALSE 0.6533444 0.4422346
# TRUE 0.6633569 0.4185751
这个基本示例中会出现一些可忽略的警告,因为 V1
中只有很少的句子包含单词 "london"。我建议将该列用于情绪分析、术语频率/反向文档频率等。
我正在使用 Caret 包(没有太多使用 Caret 的经验)来使用下面 R 代码中概述的朴素贝叶斯训练我的数据。我在执行 "nb_model" 时遇到包含句子的问题,因为它会产生一系列错误消息,它们是:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in
predict.NaiveBayes(modelFit, newdata) :
Not all variable names used in object found in newdata
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in
NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) :
请问您能否就如何调整下面的 R 代码来解决这个问题提出建议?
Dataset used in the R code below
数据集的简要示例(10 个变量):
Over arrested at in | Negative | Negative | Neutral | Neutral | Neutral | Negative |
Positive | Neutral | Negative
library(caret)
# Loading dataset
setwd("directory/path")
TrainSet = read.csv("textsent.csv", header = FALSE)
# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]
# Declaring the trainControl function
train_ctrl = trainControl(
method = "cv", #Specifying Cross validation
number = 3, # Specifying 3-fold
)
nb_model = train(
V10 ~., # Specifying the response variable and the feature variables
method = "nb", # Specifying the model to use
data = train,
trControl = train_ctrl,
)
# Get the predictions of your model in the test set
predictions = predict(nb_model, newdata = test)
# See the confusion matrix of your model in the test set
confusionMatrix(predictions, test$V10)
数据集全部为字符数据。在该数据中,有易于编码的单词 (V2
- V10
) 和句子的组合,您可以对其进行任意数量的特征工程并生成任意数量的特征。
要阅读有关文本挖掘的信息,请查看 tm
包、其文档或链接文章中的 hack-r.com for practical examples. Here's some Github code 等博客。
好的,所以我首先设置 stringsAsFactors = F
因为你的 V1
有很多独特的句子
TrainSet <- read.csv(url("https://raw.githubusercontent.com/jcool12/dataset/master/textsentiment.csv?token=AA4LAP5VXI6I7FRKMT6HDPK6U5XBY"),
header = F,
stringsAsFactors = F)
library(caret)
然后我做了特征工程
## Feature Engineering
# V2 - V10
TrainSet[TrainSet=="Negative"] <- 0
TrainSet[TrainSet=="Positive"] <- 1
# V1 - not sure what you wanted to do with this
# but here's a simple example of what
# you could do
TrainSet$V1 <- grepl("london", TrainSet$V1) # tests if london is in the string
然后它起作用了,尽管您需要改进 V1
的工程(或放弃它)以获得更好的结果。
# In reality you could probably generate 20+ decent features from this text
# word count, tons of stuff... see the tm package
# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]
# Declaring the trainControl function
train_ctrl = trainControl(
method = "cv", # Specifying Cross validation
number = 3, # Specifying 3-fold
)
nb_model = train(
V10 ~., # Specifying the response variable and the feature variables
method = "nb", # Specifying the model to use
data = train,
trControl = train_ctrl,
)
# Resampling: Cross-Validated (3 fold)
# Summary of sample sizes: 799, 800, 801
# Resampling results across tuning parameters:
#
# usekernel Accuracy Kappa
# FALSE 0.6533444 0.4422346
# TRUE 0.6633569 0.4185751
这个基本示例中会出现一些可忽略的警告,因为 V1
中只有很少的句子包含单词 "london"。我建议将该列用于情绪分析、术语频率/反向文档频率等。