R - Caret train() "Error: Stopping" with "Not all variable names used in object found in newdata"
R - Caret train() "Error: Stopping" with "Not all variable names used in object found in newdata"
我正在尝试构建一个简单的 Naive Bayes classifer for mushroom data。我想使用所有变量作为分类预测变量来预测蘑菇是否可食用。
我正在使用 caret 包。
这是我的完整代码:
##################################################################################
# Prepare R and R Studio environment
##################################################################################
# Clear the R studio console
cat("4")
# Remove objects from environment
rm(list = ls())
# Install and load packages if necessary
if (!require(tidyverse)) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require(caret)) {
install.packages("caret")
library(caret)
}
if (!require(klaR)) {
install.packages("klaR")
library(klaR)
}
#################################
mushrooms <- read.csv("agaricus-lepiota.data", stringsAsFactors = TRUE, header = FALSE)
na.omit(mushrooms)
names(mushrooms) <- c("edibility", "capShape", "capSurface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")
# convert bruises to a logical variable
mushrooms$bruises <- mushrooms$bruises == 't'
set.seed(1234)
split <- createDataPartition(mushrooms$edibility, p = 0.8, list = FALSE)
train <- mushrooms[split, ]
test <- mushrooms[-split, ]
predictors <- names(train)[2:20] #Create response and predictor data
x <- train[,predictors] #predictors
y <- train$edibility #response
train_control <- trainControl(method = "cv", number = 1) # Set up 1 fold cross validation
edibility_mod1 <- train( #train the model
x = x,
y = y,
method = "nb",
trControl = train_control
)
执行 train() 函数时,我得到以下输出:
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :2 NA's :2
Error: Stopping
In addition: Warning messages:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in predict.NaiveBayes(modelFit, newdata) :
Not all variable names used in object found in newdata
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in x[, 2] : subscript out of bounds
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
脚本后的 x 和 y 运行:
> str(x)
'data.frame': 6500 obs. of 19 variables:
$ capShape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ capSurface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ cap-color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ bruises : logi TRUE TRUE TRUE TRUE FALSE TRUE ...
$ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ gill-attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ gill-spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ gill-size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ gill-color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ stalk-shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ stalk-root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
$ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-color-above-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ stalk-color-below-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ veil-type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
$ veil-color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ ring-number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
$ ring-type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
> str(y)
Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
我的环境是:
> R.version
_
platform x86_64-apple-darwin17.0
arch x86_64
os darwin17.0
system x86_64, darwin17.0
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
> RStudio.Version()
$citation
To cite RStudio in publications use:
RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
A BibTeX entry for LaTeX users is
@Manual{,
title = {RStudio: Integrated Development Environment for R},
author = {{RStudio Team}},
organization = {RStudio, PBC},
address = {Boston, MA},
year = {2020},
url = {http://www.rstudio.com/},
}
$mode
[1] "desktop"
$version
[1] ‘1.3.1093’
$release_name
[1] "Apricot Nasturtium"
您尝试做的事情有点棘手,最朴素的贝叶斯实现或至少您正在使用的实现(来自从 e1071 派生的 kLAR)使用正态分布。可以在naiveBayes help page from e1071的详情下看到:
The standard naive Bayes classifier (at least this implementation)
assumes independence of the predictor variables, and Gaussian
distribution (given the target class) of metric predictors. For
attributes with missing values, the corresponding table entries are
omitted for prediction.
并且您的预测变量是分类变量,因此这可能会有问题。你可以尝试设置 kernel=TRUE
和 adjust=1
来强制它正常,并避免 kernel=FALSE
会抛出错误。
在此之前,我们删除了只有 1 级的列并整理了列名,同样在这种情况下,使用公式更容易避免制作虚拟变量:
df = train
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
mod1 <- train(edibility~.,data=df,
method = "nb", trControl = trainControl(method="cv",number=5),
tuneGrid=Grid
)
mod1
Naive Bayes
6500 samples
21 predictor
2 classes: 'e', 'p'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200
Resampling results across tuning parameters:
fL Accuracy Kappa
0.2 0.9243077 0.8478624
0.5 0.9243077 0.8478624
0.8 0.9243077 0.8478624
Tuning parameter 'usekernel' was held constant at a value of TRUE
Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
adjust = 1.
我正在尝试构建一个简单的 Naive Bayes classifer for mushroom data。我想使用所有变量作为分类预测变量来预测蘑菇是否可食用。
我正在使用 caret 包。
这是我的完整代码:
##################################################################################
# Prepare R and R Studio environment
##################################################################################
# Clear the R studio console
cat("4")
# Remove objects from environment
rm(list = ls())
# Install and load packages if necessary
if (!require(tidyverse)) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require(caret)) {
install.packages("caret")
library(caret)
}
if (!require(klaR)) {
install.packages("klaR")
library(klaR)
}
#################################
mushrooms <- read.csv("agaricus-lepiota.data", stringsAsFactors = TRUE, header = FALSE)
na.omit(mushrooms)
names(mushrooms) <- c("edibility", "capShape", "capSurface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")
# convert bruises to a logical variable
mushrooms$bruises <- mushrooms$bruises == 't'
set.seed(1234)
split <- createDataPartition(mushrooms$edibility, p = 0.8, list = FALSE)
train <- mushrooms[split, ]
test <- mushrooms[-split, ]
predictors <- names(train)[2:20] #Create response and predictor data
x <- train[,predictors] #predictors
y <- train$edibility #response
train_control <- trainControl(method = "cv", number = 1) # Set up 1 fold cross validation
edibility_mod1 <- train( #train the model
x = x,
y = y,
method = "nb",
trControl = train_control
)
执行 train() 函数时,我得到以下输出:
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :2 NA's :2
Error: Stopping
In addition: Warning messages:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in predict.NaiveBayes(modelFit, newdata) :
Not all variable names used in object found in newdata
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in x[, 2] : subscript out of bounds
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
脚本后的 x 和 y 运行:
> str(x)
'data.frame': 6500 obs. of 19 variables:
$ capShape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ capSurface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ cap-color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ bruises : logi TRUE TRUE TRUE TRUE FALSE TRUE ...
$ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ gill-attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ gill-spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ gill-size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ gill-color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ stalk-shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ stalk-root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
$ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-color-above-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ stalk-color-below-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ veil-type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
$ veil-color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ ring-number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
$ ring-type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
> str(y)
Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
我的环境是:
> R.version
_
platform x86_64-apple-darwin17.0
arch x86_64
os darwin17.0
system x86_64, darwin17.0
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
> RStudio.Version()
$citation
To cite RStudio in publications use:
RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
A BibTeX entry for LaTeX users is
@Manual{,
title = {RStudio: Integrated Development Environment for R},
author = {{RStudio Team}},
organization = {RStudio, PBC},
address = {Boston, MA},
year = {2020},
url = {http://www.rstudio.com/},
}
$mode
[1] "desktop"
$version
[1] ‘1.3.1093’
$release_name
[1] "Apricot Nasturtium"
您尝试做的事情有点棘手,最朴素的贝叶斯实现或至少您正在使用的实现(来自从 e1071 派生的 kLAR)使用正态分布。可以在naiveBayes help page from e1071的详情下看到:
The standard naive Bayes classifier (at least this implementation) assumes independence of the predictor variables, and Gaussian distribution (given the target class) of metric predictors. For attributes with missing values, the corresponding table entries are omitted for prediction.
并且您的预测变量是分类变量,因此这可能会有问题。你可以尝试设置 kernel=TRUE
和 adjust=1
来强制它正常,并避免 kernel=FALSE
会抛出错误。
在此之前,我们删除了只有 1 级的列并整理了列名,同样在这种情况下,使用公式更容易避免制作虚拟变量:
df = train
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
mod1 <- train(edibility~.,data=df,
method = "nb", trControl = trainControl(method="cv",number=5),
tuneGrid=Grid
)
mod1
Naive Bayes
6500 samples
21 predictor
2 classes: 'e', 'p'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200
Resampling results across tuning parameters:
fL Accuracy Kappa
0.2 0.9243077 0.8478624
0.5 0.9243077 0.8478624
0.8 0.9243077 0.8478624
Tuning parameter 'usekernel' was held constant at a value of TRUE
Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
adjust = 1.