将 oob 和 k-fold x-val 与随机森林一起使用时出现不同的 caret/train 错误
different caret/train erros when using oob and k-fold x-val with random forest
这是我使用的代码:
# data set for debugging in RStudio
data("imports85")
input<-imports85
# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "yes"
type <- "class" # either "class" or "regr" from SF doc prop
# split off rows w/o label and then split into test/train using stratified sampling
input.labelled <- input[complete.cases(input[,dependent]),]
train.index <- createDataPartition(input.labelled[,dependent], p=training.share, list=FALSE)
rf.train <- input.labelled[train.index,]
rf.test <- input.labelled[-train.index,]
# create cleaned train data set w/ or w/o imputation
if (impute=="no") {
rf.train.clean <- rf.train[complete.cases(rf.train),] #drop cases w/ missing variables
} else if (impute=="yes") {
rf.train.clean <- rfImpute(rf.train[,dependent] ~ .,rf.train)[,-1] #impute missing variables and remove added duplicate of dependent column
}
# define variables Y and dependent x
Y <- rf.train.clean[, names(rf.train.clean) == dependent]
x <- rf.train.clean[, names(rf.train.clean) != dependent]
# upsample minorty classes (classification only)
if (type=="class") {
rf.train.upsampled <- upSample(x=x, y=Y)
}
# train and tune RF model
cntrl<-trainControl(method = "oob", number=5, p=0.9, sampling = "up", search='grid') # oob error to tune model
tunegrid <- expand.grid(.mtry = (1:5)) #create tunegrid with 5 values from 1:5 for mtry to tunning model
rf <- train(x, Y, method="rf", metric="Accuracy", trControl=cntrl, tuneGrid=tunegrid)
第一个错误与 有点相关,但使用 caret
和 randomForest
而不是 lars
我不明白...
顺序错误(x[ 1]):'x' 必须是 'sort.list' 的原子向量 - 您是否在列表中调用了 'sort'?
不,我没有在列表中调用 'sort'... 至少我不知道 ;-)
我检查了 documentation 中的 caret
/ train
,它说 x 应该是一个数据框,根据 str(x)
.
如果我使用 k-fold x-validation 而不是像这样的 oob 错误
cntrl<-trainControl(method = "repeatedcv", number=5, repeats = 2, p=0.9, sampling = "up", search='grid')
还有一个有趣的错误:
y
中不能有空 类
检查 complete.cases(Y)
似乎表明没有空 类 但是...
有人给我提示吗?
谢谢,
马克
这是因为你的因变量。您选择了 make
。你检查过这个领域吗?你有培训和测试;你把只有一个观察结果的结果放在哪里,比如 make = "mercury"
?你怎么能用那个训练?如果你不为此训练,你怎么能测试它?
input %>%
group_by(make) %>%
summarise(count = n()) %>%
arrange(count) %>%
print(n = 22)
# # A tibble: 22 × 2
# make count
# <fct> <int>
# 1 mercury 1
# 2 renault 2
# 3 alfa-romero 3
# 4 chevrolet 3
# 5 jaguar 3
# 6 isuzu 4
# 7 porsche 5
# 8 saab 6
# 9 audi 7
# 10 plymouth 7
# 11 bmw 8
# 12 mercedes-benz 8
# 13 dodge 9
# 14 peugot 11
# 15 volvo 11
# 16 subaru 12
# 17 volkswagen 12
# 18 honda 13
# 19 mitsubishi 13
# 20 mazda 17
# 21 nissan 18
# 22 toyota 32
您在执行函数 createDataPartition()
时也有警告。我认为 randomForest
套餐要求每组至少五个。您可以筛选要包含的组并将该数据用于测试和培训。
在标记为 settings
的评论之前,您可以添加以下内容以对组进行子集化并验证结果。
filtGrps <- input %>%
group_by(make) %>%
summarise(count = n()) %>%
filter(count >=5) %>%
select(make) %>%
unlist()
# filter for groups with sufficient observations for package
input <- input %>%
filter(make %in% filtGrps) %>%
droplevels() # then drop the empty levels
# check to see if it filtered as expected
input %>%
group_by(make) %>%
summarise(count = n()) %>%
arrange(-count) %>%
print(n = 16)
这只使用了 5 个,这并不理想。 (越多越好。)
但是,您的所有代码都适用于此过滤器。
rf
# Random Forest
#
# 147 samples
# 25 predictor
# 16 classes: 'audi', 'bmw', 'dodge', 'honda', 'mazda', 'mercedes-benz', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'
#
# No pre-processing
# Addtional sampling using up-sampling
#
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 1 0.9505208 0.9472222
# 2 0.9869792 0.9861111
# 3 0.9869792 0.9861111
# 4 0.9895833 0.9888889
# 5 0.9921875 0.9916667
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 5.
rf$finalModel
#
# Call:
# randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 5
#
# OOB estimate of error rate: 0.52%
# Confusion matrix:
# audi bmw dodge honda mazda mercedes-benz mitsubishi nissan peugot
# audi 24 0 0 0 0 0 0 0 0
# bmw 0 24 0 0 0 0 0 0 0
# dodge 0 0 24 0 0 0 0 0 0
# honda 0 0 0 24 0 0 0 0 0
# mazda 0 0 0 0 24 0 0 0 0
# mercedes-benz 0 0 0 0 0 24 0 0 0
# mitsubishi 0 0 0 0 0 0 24 0 0
# nissan 0 0 0 0 0 0 0 24 0
# peugot 0 0 0 0 0 0 0 0 24
# plymouth 0 0 0 0 0 0 0 0 0
# porsche 0 0 0 0 0 0 0 0 0
# saab 0 0 0 0 0 0 0 0 0
# subaru 0 0 0 0 0 0 0 0 0
# toyota 0 0 0 0 0 0 0 1 0
# volkswagen 0 0 0 0 0 0 0 0 0
# volvo 0 0 0 0 0 0 0 0 0
# plymouth porsche saab subaru toyota volkswagen volvo class.error
# audi 0 0 0 0 0 0 0 0.00000000
# bmw 0 0 0 0 0 0 0 0.00000000
# dodge 0 0 0 0 0 0 0 0.00000000
# honda 0 0 0 0 0 0 0 0.00000000
# mazda 0 0 0 0 0 0 0 0.00000000
# mercedes-benz 0 0 0 0 0 0 0 0.00000000
# mitsubishi 0 0 0 0 0 0 0 0.00000000
# nissan 0 0 0 0 0 0 0 0.00000000
# peugot 0 0 0 0 0 0 0 0.00000000
# plymouth 24 0 0 0 0 0 0 0.00000000
# porsche 0 24 0 0 0 0 0 0.00000000
# saab 0 0 24 0 0 0 0 0.00000000
# subaru 0 0 0 24 0 0 0 0.00000000
# toyota 0 0 0 0 22 0 1 0.08333333
# volkswagen 0 0 0 0 0 24 0 0.00000000
# volvo 0 0 0 0 0 0 24 0.00000000
当然,您仍然想测试这个模型。
library(randomForest)
library(caret)
library(dplyr)
remove(list=ls())
# data set for debugging in RStudio
data("imports85")
input<-imports85
filtGrps <- input %>%
group_by(make) %>%
summarise(count = n()) %>%
filter(count >=5) %>%
select(make) %>%
unlist()
# filter for groups with sufficient observations for package
input <- input %>%
filter(make %in% filtGrps) %>%
droplevels() # then drop the empty levels
# check to see if it filtered as expected
input %>%
group_by(make) %>%
summarise(count = n()) %>%
arrange(-count) %>%
print(n = 16)
# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "yes"
type <- "class" # either "class" or "regr" from SF doc prop
# split off rows w/o label and then split into test/train using stratified sampling
input.labelled <- input[complete.cases(input[,dependent]),]
train.index <- createDataPartition(input.labelled[,dependent], p=training.share, list=FALSE)
rf.train <- input.labelled[train.index,]
rf.test <- input.labelled[-train.index,]
# create cleaned train data set w/ or w/o imputation
if (impute=="no") {
rf.train.clean <- rf.train[complete.cases(rf.train),] #drop cases w/ missing variables
} else if (impute=="yes") {
rf.train.clean <- rfImpute(rf.train[,dependent] ~ .,rf.train)[,-1] #impute missing variables and remove added duplicate of dependent column
}
# define variables Y and dependent x
Y <- rf.train.clean[, names(rf.train.clean) == dependent]
x <- rf.train.clean[, names(rf.train.clean) != dependent]
# upsample minorty classes (classification only)
if (type=="class") {
rf.train.upsampled <- upSample(x=x, y=Y)
}
# train and tune RF model
cntrl<-trainControl(method = "oob", number=5, p=0.9, sampling = "up", search='grid') # oob error to tune model
tunegrid <- expand.grid(.mtry = (1:5)) #create tunegrid with 5 values from 1:5 for mtry to tunning model
rf <- train(x, Y, method="rf", metric="Accuracy", trControl=cntrl, tuneGrid=tunegrid)
这是我使用的代码:
# data set for debugging in RStudio
data("imports85")
input<-imports85
# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "yes"
type <- "class" # either "class" or "regr" from SF doc prop
# split off rows w/o label and then split into test/train using stratified sampling
input.labelled <- input[complete.cases(input[,dependent]),]
train.index <- createDataPartition(input.labelled[,dependent], p=training.share, list=FALSE)
rf.train <- input.labelled[train.index,]
rf.test <- input.labelled[-train.index,]
# create cleaned train data set w/ or w/o imputation
if (impute=="no") {
rf.train.clean <- rf.train[complete.cases(rf.train),] #drop cases w/ missing variables
} else if (impute=="yes") {
rf.train.clean <- rfImpute(rf.train[,dependent] ~ .,rf.train)[,-1] #impute missing variables and remove added duplicate of dependent column
}
# define variables Y and dependent x
Y <- rf.train.clean[, names(rf.train.clean) == dependent]
x <- rf.train.clean[, names(rf.train.clean) != dependent]
# upsample minorty classes (classification only)
if (type=="class") {
rf.train.upsampled <- upSample(x=x, y=Y)
}
# train and tune RF model
cntrl<-trainControl(method = "oob", number=5, p=0.9, sampling = "up", search='grid') # oob error to tune model
tunegrid <- expand.grid(.mtry = (1:5)) #create tunegrid with 5 values from 1:5 for mtry to tunning model
rf <- train(x, Y, method="rf", metric="Accuracy", trControl=cntrl, tuneGrid=tunegrid)
第一个错误与 caret
和 randomForest
而不是 lars
我不明白...
顺序错误(x[ 1]):'x' 必须是 'sort.list' 的原子向量 - 您是否在列表中调用了 'sort'?
不,我没有在列表中调用 'sort'... 至少我不知道 ;-)
我检查了 documentation 中的 caret
/ train
,它说 x 应该是一个数据框,根据 str(x)
.
如果我使用 k-fold x-validation 而不是像这样的 oob 错误
cntrl<-trainControl(method = "repeatedcv", number=5, repeats = 2, p=0.9, sampling = "up", search='grid')
还有一个有趣的错误: y
中不能有空 类检查 complete.cases(Y)
似乎表明没有空 类 但是...
有人给我提示吗?
谢谢, 马克
这是因为你的因变量。您选择了 make
。你检查过这个领域吗?你有培训和测试;你把只有一个观察结果的结果放在哪里,比如 make = "mercury"
?你怎么能用那个训练?如果你不为此训练,你怎么能测试它?
input %>%
group_by(make) %>%
summarise(count = n()) %>%
arrange(count) %>%
print(n = 22)
# # A tibble: 22 × 2
# make count
# <fct> <int>
# 1 mercury 1
# 2 renault 2
# 3 alfa-romero 3
# 4 chevrolet 3
# 5 jaguar 3
# 6 isuzu 4
# 7 porsche 5
# 8 saab 6
# 9 audi 7
# 10 plymouth 7
# 11 bmw 8
# 12 mercedes-benz 8
# 13 dodge 9
# 14 peugot 11
# 15 volvo 11
# 16 subaru 12
# 17 volkswagen 12
# 18 honda 13
# 19 mitsubishi 13
# 20 mazda 17
# 21 nissan 18
# 22 toyota 32
您在执行函数 createDataPartition()
时也有警告。我认为 randomForest
套餐要求每组至少五个。您可以筛选要包含的组并将该数据用于测试和培训。
在标记为 settings
的评论之前,您可以添加以下内容以对组进行子集化并验证结果。
filtGrps <- input %>%
group_by(make) %>%
summarise(count = n()) %>%
filter(count >=5) %>%
select(make) %>%
unlist()
# filter for groups with sufficient observations for package
input <- input %>%
filter(make %in% filtGrps) %>%
droplevels() # then drop the empty levels
# check to see if it filtered as expected
input %>%
group_by(make) %>%
summarise(count = n()) %>%
arrange(-count) %>%
print(n = 16)
这只使用了 5 个,这并不理想。 (越多越好。)
但是,您的所有代码都适用于此过滤器。
rf
# Random Forest
#
# 147 samples
# 25 predictor
# 16 classes: 'audi', 'bmw', 'dodge', 'honda', 'mazda', 'mercedes-benz', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'
#
# No pre-processing
# Addtional sampling using up-sampling
#
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 1 0.9505208 0.9472222
# 2 0.9869792 0.9861111
# 3 0.9869792 0.9861111
# 4 0.9895833 0.9888889
# 5 0.9921875 0.9916667
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 5.
rf$finalModel
#
# Call:
# randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 5
#
# OOB estimate of error rate: 0.52%
# Confusion matrix:
# audi bmw dodge honda mazda mercedes-benz mitsubishi nissan peugot
# audi 24 0 0 0 0 0 0 0 0
# bmw 0 24 0 0 0 0 0 0 0
# dodge 0 0 24 0 0 0 0 0 0
# honda 0 0 0 24 0 0 0 0 0
# mazda 0 0 0 0 24 0 0 0 0
# mercedes-benz 0 0 0 0 0 24 0 0 0
# mitsubishi 0 0 0 0 0 0 24 0 0
# nissan 0 0 0 0 0 0 0 24 0
# peugot 0 0 0 0 0 0 0 0 24
# plymouth 0 0 0 0 0 0 0 0 0
# porsche 0 0 0 0 0 0 0 0 0
# saab 0 0 0 0 0 0 0 0 0
# subaru 0 0 0 0 0 0 0 0 0
# toyota 0 0 0 0 0 0 0 1 0
# volkswagen 0 0 0 0 0 0 0 0 0
# volvo 0 0 0 0 0 0 0 0 0
# plymouth porsche saab subaru toyota volkswagen volvo class.error
# audi 0 0 0 0 0 0 0 0.00000000
# bmw 0 0 0 0 0 0 0 0.00000000
# dodge 0 0 0 0 0 0 0 0.00000000
# honda 0 0 0 0 0 0 0 0.00000000
# mazda 0 0 0 0 0 0 0 0.00000000
# mercedes-benz 0 0 0 0 0 0 0 0.00000000
# mitsubishi 0 0 0 0 0 0 0 0.00000000
# nissan 0 0 0 0 0 0 0 0.00000000
# peugot 0 0 0 0 0 0 0 0.00000000
# plymouth 24 0 0 0 0 0 0 0.00000000
# porsche 0 24 0 0 0 0 0 0.00000000
# saab 0 0 24 0 0 0 0 0.00000000
# subaru 0 0 0 24 0 0 0 0.00000000
# toyota 0 0 0 0 22 0 1 0.08333333
# volkswagen 0 0 0 0 0 24 0 0.00000000
# volvo 0 0 0 0 0 0 24 0.00000000
当然,您仍然想测试这个模型。
library(randomForest)
library(caret)
library(dplyr)
remove(list=ls())
# data set for debugging in RStudio
data("imports85")
input<-imports85
filtGrps <- input %>%
group_by(make) %>%
summarise(count = n()) %>%
filter(count >=5) %>%
select(make) %>%
unlist()
# filter for groups with sufficient observations for package
input <- input %>%
filter(make %in% filtGrps) %>%
droplevels() # then drop the empty levels
# check to see if it filtered as expected
input %>%
group_by(make) %>%
summarise(count = n()) %>%
arrange(-count) %>%
print(n = 16)
# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "yes"
type <- "class" # either "class" or "regr" from SF doc prop
# split off rows w/o label and then split into test/train using stratified sampling
input.labelled <- input[complete.cases(input[,dependent]),]
train.index <- createDataPartition(input.labelled[,dependent], p=training.share, list=FALSE)
rf.train <- input.labelled[train.index,]
rf.test <- input.labelled[-train.index,]
# create cleaned train data set w/ or w/o imputation
if (impute=="no") {
rf.train.clean <- rf.train[complete.cases(rf.train),] #drop cases w/ missing variables
} else if (impute=="yes") {
rf.train.clean <- rfImpute(rf.train[,dependent] ~ .,rf.train)[,-1] #impute missing variables and remove added duplicate of dependent column
}
# define variables Y and dependent x
Y <- rf.train.clean[, names(rf.train.clean) == dependent]
x <- rf.train.clean[, names(rf.train.clean) != dependent]
# upsample minorty classes (classification only)
if (type=="class") {
rf.train.upsampled <- upSample(x=x, y=Y)
}
# train and tune RF model
cntrl<-trainControl(method = "oob", number=5, p=0.9, sampling = "up", search='grid') # oob error to tune model
tunegrid <- expand.grid(.mtry = (1:5)) #create tunegrid with 5 values from 1:5 for mtry to tunning model
rf <- train(x, Y, method="rf", metric="Accuracy", trControl=cntrl, tuneGrid=tunegrid)