随机森林 - mtry 如何大于自变量总数?
Random Forest - how mtry is larger than total number of independent variable?
1) 我尝试使用回归随机森林来训练 185 行的数据集,其中包含 4 个自变量。
2个分类变量各有3个水平,13个水平。另外 2 个变量是数字连续变量。
我尝试使用 RF 进行 10 次交叉验证,重复 4 次。 (我没有缩放因变量,这就是 RMSE 如此之大的原因。)
我猜 mtry 大于 4 的原因是分类变量总共有 3+13= 16 个水平。但如果是这样,为什么它不包括数字变量 number?
185 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times)
Summary of sample sizes: 168, 165, 166, 167, 166, 167, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 16764183 0.7843863 9267902
9 9451598 0.8615202 3977457
16 9639984 0.8586409 3813891
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.
请帮助我理解mtry。
2)另外,每折样本量是168,165,166,....,为什么样本量在变化?
sample sizes: 168, 165, 166, 167, 166, 167
非常感谢。
您是正确的,因为有 16 个变量可供采样,因此 mtry 的最大值为 16。
caret选择的值基于两个参数,在train中,tuneLength有一个选项,默认为3:
tuneLength = ifelse(trControl$method == "none", 1, 3)
这意味着它测试三个值。对于randomForest,你有mtry,默认是:
getModelInfo("rf")[[1]]$grid
function(x, y, len = NULL, search = "grid"){
if(search == "grid") {
out <- data.frame(mtry = caret::var_seq(p = ncol(x),
classification = is.factor(y),
len = len))
} else {
out <- data.frame(mtry = unique(sample(1:ncol(x), size = len, replace = TRUE)))
}
out
}
因为你有 16 列,它变成:
var_seq(16,len=3)
[1] 2 9 16
您可以通过设置来测试您选择的mtry:
library(caret)
trCtrl = trainControl(method="repeatedcv",repeats=4,number=10)
# we test 2,4,6..16
trg = data.frame(mtry=seq(2,16,by=2))
# some random data for example
df = data.frame(y=rnorm(200),x1 = sample(letters[1:13],200,replace=TRUE),
x2=sample(LETTERS[1:3],200,replace=TRUE),x3=rpois(200,10),x4=runif(200))
#fit
mdl = train(y ~.,data=df,tuneGrid=trg,trControl =trCtrl)
Random Forest
200 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 1.120216 0.04448700 0.8978851
4 1.157185 0.04424401 0.9275939
6 1.172316 0.04902991 0.9371778
8 1.186861 0.05276752 0.9485516
10 1.193595 0.05490291 0.9543479
12 1.200837 0.05608624 0.9574420
14 1.205663 0.05374614 0.9621094
16 1.210783 0.05537412 0.9665665
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
1) 我尝试使用回归随机森林来训练 185 行的数据集,其中包含 4 个自变量。 2个分类变量各有3个水平,13个水平。另外 2 个变量是数字连续变量。
我尝试使用 RF 进行 10 次交叉验证,重复 4 次。 (我没有缩放因变量,这就是 RMSE 如此之大的原因。)
我猜 mtry 大于 4 的原因是分类变量总共有 3+13= 16 个水平。但如果是这样,为什么它不包括数字变量 number?
185 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times)
Summary of sample sizes: 168, 165, 166, 167, 166, 167, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 16764183 0.7843863 9267902
9 9451598 0.8615202 3977457
16 9639984 0.8586409 3813891
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.
请帮助我理解mtry。
2)另外,每折样本量是168,165,166,....,为什么样本量在变化?
sample sizes: 168, 165, 166, 167, 166, 167
非常感谢。
您是正确的,因为有 16 个变量可供采样,因此 mtry 的最大值为 16。
caret选择的值基于两个参数,在train中,tuneLength有一个选项,默认为3:
tuneLength = ifelse(trControl$method == "none", 1, 3)
这意味着它测试三个值。对于randomForest,你有mtry,默认是:
getModelInfo("rf")[[1]]$grid
function(x, y, len = NULL, search = "grid"){
if(search == "grid") {
out <- data.frame(mtry = caret::var_seq(p = ncol(x),
classification = is.factor(y),
len = len))
} else {
out <- data.frame(mtry = unique(sample(1:ncol(x), size = len, replace = TRUE)))
}
out
}
因为你有 16 列,它变成:
var_seq(16,len=3)
[1] 2 9 16
您可以通过设置来测试您选择的mtry:
library(caret)
trCtrl = trainControl(method="repeatedcv",repeats=4,number=10)
# we test 2,4,6..16
trg = data.frame(mtry=seq(2,16,by=2))
# some random data for example
df = data.frame(y=rnorm(200),x1 = sample(letters[1:13],200,replace=TRUE),
x2=sample(LETTERS[1:3],200,replace=TRUE),x3=rpois(200,10),x4=runif(200))
#fit
mdl = train(y ~.,data=df,tuneGrid=trg,trControl =trCtrl)
Random Forest
200 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 1.120216 0.04448700 0.8978851
4 1.157185 0.04424401 0.9275939
6 1.172316 0.04902991 0.9371778
8 1.186861 0.05276752 0.9485516
10 1.193595 0.05490291 0.9543479
12 1.200837 0.05608624 0.9574420
14 1.205663 0.05374614 0.9621094
16 1.210783 0.05537412 0.9665665
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.