R - Caret train() "Error: Stopping" with "Not all variable names used in object found in newdata"

R - Caret train() "Error: Stopping" with "Not all variable names used in object found in newdata"

我正在尝试构建一个简单的 Naive Bayes classifer for mushroom data。我想使用所有变量作为分类预测变量来预测蘑菇是否可食用。

我正在使用 caret 包。

这是我的完整代码:

##################################################################################
# Prepare R and R Studio environment
##################################################################################

# Clear the R studio console
cat("4")

# Remove objects from environment
rm(list = ls())

# Install and load packages if necessary
if (!require(tidyverse)) {
  install.packages("tidyverse")
  library(tidyverse)
}
if (!require(caret)) {
  install.packages("caret")
  library(caret)
}
if (!require(klaR)) {
  install.packages("klaR")
  library(klaR)
}

#################################

mushrooms <- read.csv("agaricus-lepiota.data", stringsAsFactors = TRUE, header = FALSE)

na.omit(mushrooms)

names(mushrooms) <- c("edibility", "capShape", "capSurface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")

# convert bruises to a logical variable
mushrooms$bruises <- mushrooms$bruises == 't'

set.seed(1234)
split <- createDataPartition(mushrooms$edibility, p = 0.8, list = FALSE)

train <- mushrooms[split, ]
test <- mushrooms[-split, ]

predictors <- names(train)[2:20] #Create response and predictor data

x <- train[,predictors] #predictors
y <- train$edibility #response

train_control <- trainControl(method = "cv", number = 1) # Set up 1 fold cross validation

edibility_mod1 <- train( #train the model
  x = x,
  y = y,
  method = "nb", 
  trControl = train_control
)

执行 train() 函数时,我得到以下输出:

Something is wrong; all the Accuracy metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :2     NA's   :2    
Error: Stopping
In addition: Warning messages:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in predict.NaiveBayes(modelFit, newdata) : 
  Not all variable names used in object found in newdata
 
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in x[, 2] : subscript out of bounds
 
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

脚本后的 x 和 y 运行:

> str(x)
'data.frame':   6500 obs. of  19 variables:
 $ capShape                : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
 $ capSurface              : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
 $ cap-color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
 $ bruises                 : logi  TRUE TRUE TRUE TRUE FALSE TRUE ...
 $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
 $ gill-attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
 $ gill-spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
 $ gill-size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
 $ gill-color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
 $ stalk-shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
 $ stalk-root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
 $ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ stalk-color-above-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ stalk-color-below-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ veil-type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
 $ veil-color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ ring-number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
 $ ring-type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...



> str(y)
 Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...

我的环境是:

> R.version
               _                           
platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          0.3                         
year           2020                        
month          10                          
day            10                          
svn rev        79318                       
language       R                           
version.string R version 4.0.3 (2020-10-10)
nickname       Bunny-Wunnies Freak Out     
> RStudio.Version()
$citation

To cite RStudio in publications use:

  RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {RStudio: Integrated Development Environment for R},
    author = {{RStudio Team}},
    organization = {RStudio, PBC},
    address = {Boston, MA},
    year = {2020},
    url = {http://www.rstudio.com/},
  }


$mode
[1] "desktop"

$version
[1] ‘1.3.1093’

$release_name
[1] "Apricot Nasturtium"

您尝试做的事情有点棘手,最朴素的贝叶斯实现或至少您正在使用的实现(来自从 e1071 派生的 kLAR)使用正态分布。可以在naiveBayes help page from e1071的详情下看到:

The standard naive Bayes classifier (at least this implementation) assumes independence of the predictor variables, and Gaussian distribution (given the target class) of metric predictors. For attributes with missing values, the corresponding table entries are omitted for prediction.

并且您的预测变量是分类变量,因此这可能会有问题。你可以尝试设置 kernel=TRUEadjust=1 来强制它正常,并避免 kernel=FALSE 会抛出错误。

在此之前,我们删除了只有 1 级的列并整理了列名,同样在这种情况下,使用公式更容易避免制作虚拟变量:

df = train 
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))

Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))

mod1 <- train(edibility~.,data=df,
  method = "nb", trControl = trainControl(method="cv",number=5),
  tuneGrid=Grid
)

 mod1
Naive Bayes 

6500 samples
  21 predictor
   2 classes: 'e', 'p' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200 
Resampling results across tuning parameters:

  fL   Accuracy   Kappa    
  0.2  0.9243077  0.8478624
  0.5  0.9243077  0.8478624
  0.8  0.9243077  0.8478624

Tuning parameter 'usekernel' was held constant at a value of TRUE

Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
 adjust = 1.