将 oob 和 k-fold x-val 与随机森林一起使用时出现不同的 caret/train 错误

different caret/train erros when using oob and k-fold x-val with random forest

这是我使用的代码:

# data set for debugging in RStudio
data("imports85")
input<-imports85

# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "yes"
type <- "class" # either "class" or "regr" from SF doc prop


# split off rows w/o label and then split into test/train using stratified sampling
input.labelled <- input[complete.cases(input[,dependent]),]
train.index <- createDataPartition(input.labelled[,dependent], p=training.share, list=FALSE)
rf.train <- input.labelled[train.index,]
rf.test <- input.labelled[-train.index,]

# create cleaned train data set w/ or w/o imputation
if (impute=="no") {
    rf.train.clean <- rf.train[complete.cases(rf.train),] #drop cases w/ missing variables
} else if (impute=="yes") {
    rf.train.clean <- rfImpute(rf.train[,dependent] ~ .,rf.train)[,-1] #impute missing variables and remove added duplicate of dependent column
}

# define variables Y and dependent x
Y <- rf.train.clean[, names(rf.train.clean) == dependent]
x <- rf.train.clean[, names(rf.train.clean) != dependent]

# upsample minorty classes (classification only)
if (type=="class") {
    rf.train.upsampled <- upSample(x=x, y=Y)
}

# train and tune RF model
cntrl<-trainControl(method = "oob", number=5, p=0.9, sampling = "up", search='grid') # oob error to tune model
tunegrid <- expand.grid(.mtry = (1:5)) #create tunegrid with 5 values from 1:5 for mtry to tunning model
rf <- train(x, Y, method="rf", metric="Accuracy", trControl=cntrl, tuneGrid=tunegrid)

第一个错误与 有点相关,但使用 caretrandomForest 而不是 lars 我不明白... 顺序错误(x[ 1]):'x' 必须是 'sort.list' 的原子向量 - 您是否在列表中调用了 'sort'? 不,我没有在列表中调用 'sort'... 至少我不知道 ;-)

我检查了 documentation 中的 caret / train,它说 x 应该是一个数据框,根据 str(x).

如果我使用 k-fold x-validation 而不是像这样的 oob 错误

cntrl<-trainControl(method = "repeatedcv", number=5, repeats = 2, p=0.9, sampling = "up", search='grid')

还有一个有趣的错误: y

中不能有空 类

检查 complete.cases(Y) 似乎表明没有空 类 但是...

有人给我提示吗?

谢谢, 马克

这是因为你的因变量。您选择了 make。你检查过这个领域吗?你有培训和测试;你把只有一个观察结果的结果放在哪里,比如 make = "mercury"?你怎么能用那个训练?如果你不为此训练,你怎么能测试它?

input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  arrange(count) %>% 
  print(n = 22)

# # A tibble: 22 × 2
#    make        count
#    <fct>       <int>
#  1 mercury         1
#  2 renault         2
#  3 alfa-romero     3
#  4 chevrolet       3
#  5 jaguar          3
#  6 isuzu           4
#  7 porsche         5
#  8 saab            6
#  9 audi            7
# 10 plymouth        7
# 11 bmw             8
# 12 mercedes-benz   8
# 13 dodge           9
# 14 peugot         11
# 15 volvo          11
# 16 subaru         12
# 17 volkswagen     12
# 18 honda          13
# 19 mitsubishi     13
# 20 mazda          17
# 21 nissan         18
# 22 toyota         32

您在执行函数 createDataPartition() 时也有警告。我认为 randomForest 套餐要求每组至少五个。您可以筛选要包含的组并将该数据用于测试和培训。

在标记为 settings 的评论之前,您可以添加以下内容以对组进行子集化并验证结果。

filtGrps <- input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  filter(count >=5) %>% 
  select(make) %>% 
  unlist()

# filter for groups with sufficient observations for package
input <- input %>% 
  filter(make %in% filtGrps) %>% 
  droplevels() # then drop the empty levels

# check to see if it filtered as expected
input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  arrange(-count) %>% 
  print(n = 16)

这只使用了 5 个,这并不理想。 (越多越好。)

但是,您的所有代码都适用于此过滤器。

rf
# Random Forest 
# 
# 147 samples
#  25 predictor
#  16 classes: 'audi', 'bmw', 'dodge', 'honda', 'mazda', 'mercedes-benz', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo' 
# 
# No pre-processing
# Addtional sampling using up-sampling
# 
# Resampling results across tuning parameters:
# 
#   mtry  Accuracy   Kappa    
#   1     0.9505208  0.9472222
#   2     0.9869792  0.9861111
#   3     0.9869792  0.9861111
#   4     0.9895833  0.9888889
#   5     0.9921875  0.9916667
# 
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 5. 
rf$finalModel
# 
# Call:
#  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 5
# 
#         OOB estimate of  error rate: 0.52%
# Confusion matrix:
#               audi bmw dodge honda mazda mercedes-benz mitsubishi nissan peugot
# audi            24   0     0     0     0             0          0      0      0
# bmw              0  24     0     0     0             0          0      0      0
# dodge            0   0    24     0     0             0          0      0      0
# honda            0   0     0    24     0             0          0      0      0
# mazda            0   0     0     0    24             0          0      0      0
# mercedes-benz    0   0     0     0     0            24          0      0      0
# mitsubishi       0   0     0     0     0             0         24      0      0
# nissan           0   0     0     0     0             0          0     24      0
# peugot           0   0     0     0     0             0          0      0     24
# plymouth         0   0     0     0     0             0          0      0      0
# porsche          0   0     0     0     0             0          0      0      0
# saab             0   0     0     0     0             0          0      0      0
# subaru           0   0     0     0     0             0          0      0      0
# toyota           0   0     0     0     0             0          0      1      0
# volkswagen       0   0     0     0     0             0          0      0      0
# volvo            0   0     0     0     0             0          0      0      0
#               plymouth porsche saab subaru toyota volkswagen volvo class.error
# audi                 0       0    0      0      0          0     0  0.00000000
# bmw                  0       0    0      0      0          0     0  0.00000000
# dodge                0       0    0      0      0          0     0  0.00000000
# honda                0       0    0      0      0          0     0  0.00000000
# mazda                0       0    0      0      0          0     0  0.00000000
# mercedes-benz        0       0    0      0      0          0     0  0.00000000
# mitsubishi           0       0    0      0      0          0     0  0.00000000
# nissan               0       0    0      0      0          0     0  0.00000000
# peugot               0       0    0      0      0          0     0  0.00000000
# plymouth            24       0    0      0      0          0     0  0.00000000
# porsche              0      24    0      0      0          0     0  0.00000000
# saab                 0       0   24      0      0          0     0  0.00000000
# subaru               0       0    0     24      0          0     0  0.00000000
# toyota               0       0    0      0     22          0     1  0.08333333
# volkswagen           0       0    0      0      0         24     0  0.00000000
# volvo                0       0    0      0      0          0    24  0.00000000 

当然,您仍然想测试这个模型。

library(randomForest)
library(caret)
library(dplyr)

remove(list=ls())

# data set for debugging in RStudio
data("imports85")
input<-imports85

filtGrps <- input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  filter(count >=5) %>% 
  select(make) %>% 
  unlist()

# filter for groups with sufficient observations for package
input <- input %>% 
  filter(make %in% filtGrps) %>% 
  droplevels() # then drop the empty levels

# check to see if it filtered as expected
input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  arrange(-count) %>% 
  print(n = 16)

# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "yes"
type <- "class" # either "class" or "regr" from SF doc prop


# split off rows w/o label and then split into test/train using stratified sampling
input.labelled <- input[complete.cases(input[,dependent]),]
train.index <- createDataPartition(input.labelled[,dependent], p=training.share, list=FALSE)
rf.train <- input.labelled[train.index,]
rf.test <- input.labelled[-train.index,]

# create cleaned train data set w/ or w/o imputation
if (impute=="no") {
    rf.train.clean <- rf.train[complete.cases(rf.train),] #drop cases w/ missing variables
} else if (impute=="yes") {
    rf.train.clean <- rfImpute(rf.train[,dependent] ~ .,rf.train)[,-1] #impute missing variables and remove added duplicate of dependent column
}

# define variables Y and dependent x
Y <- rf.train.clean[, names(rf.train.clean) == dependent]
x <- rf.train.clean[, names(rf.train.clean) != dependent]

# upsample minorty classes (classification only)
if (type=="class") {
    rf.train.upsampled <- upSample(x=x, y=Y)
}

# train and tune RF model
cntrl<-trainControl(method = "oob", number=5, p=0.9, sampling = "up", search='grid') # oob error to tune model
tunegrid <- expand.grid(.mtry = (1:5)) #create tunegrid with 5 values from 1:5 for mtry to tunning model
rf <- train(x, Y, method="rf", metric="Accuracy", trControl=cntrl, tuneGrid=tunegrid)