如何确定哪个折叠最终被用作CV中的测试?
How to determine which fold was finally used as a test in CV?
在 mlr 包中的 5 折交叉验证中,如何确定最终使用哪个折作为测试以及哪个折作为训练?方法 $resampling$train.inds
和 $resampling$test.inds
returns 所有 5 折都没有最终用于训练和测试目的的信息。
library("mlr")
regr_task = makeRegrTask(data = mtcars, target = "hp")
learner = makeLearner("regr.ranger",
importance = "impurity",
num.threads = 3)
par_set = makeParamSet(
makeIntegerParam("num.trees", lower = 100L, upper = 500L),
makeIntegerParam("mtry", lower = 4L, upper = 8L)
)
rdesc = makeResampleDesc("CV", iters = 5, predict = "both")
meas = rmse
ctrl = makeTuneControlGrid()
set.seed(1)
tuned_model = tuneParams(learner = learner,
task = regr_task,
resampling = rdesc,
measures = list(meas, setAggregation(meas, train.mean)),
par.set = par_set,
control = ctrl,
show.info = FALSE)
tuned_model
model_rf = setHyperPars(learner = learner, par.vals = tuned_model$x)
set.seed(1)
model_rf = train(learner = model_rf, task = regr_task)
model_rf
tuned_model$resampling$train.inds
tuned_model$resampling$test.inds
你搞混了。
您正在将数据分成 5 份。每个折叠包含训练 和 测试数据。
这就是为什么您会为 $resampling$train.inds
和 $resampling$test.inds
返回一个 5 的列表。如果分成 5 份,您将在 4 个分区(80% 的数据)上进行训练,并在 1 个分区(20% 的数据)上进行评估。
正确的措辞是:"Which indices where used in which fold for training and testing?"。下面的代码回答了这个问题。
tuned_model$resampling$train.inds
[[1]]
[1] 10 32 6 15 20 28 26 12 8 24 31 27 22 2 13 29 17 11 1 3 16 18 21 19 9 5
[[2]]
[1] 10 6 15 28 26 12 23 30 8 25 24 7 31 27 14 2 13 29 17 1 16 4 21 19 9
[[3]]
[1] 10 32 20 26 12 23 30 8 25 7 27 22 14 2 13 29 17 11 1 3 16 18 4 19 5
[[4]]
[1] 32 6 15 20 28 26 12 23 30 25 24 7 31 22 14 13 17 11 1 3 18 4 21 19 9 5
[[5]]
[1] 10 32 6 15 20 28 23 30 8 25 24 7 31 27 22 14 2 29 11 3 16 18 4 21 9 5
> tuned_model$resampling$test.inds
[[1]]
[1] 4 7 14 23 25 30
[[2]]
[1] 3 5 11 18 20 22 32
[[3]]
[1] 6 9 15 21 24 28 31
[[4]]
[1] 2 8 10 16 27 29
[[5]]
[1] 1 12 13 17 19 26
在 mlr 包中的 5 折交叉验证中,如何确定最终使用哪个折作为测试以及哪个折作为训练?方法 $resampling$train.inds
和 $resampling$test.inds
returns 所有 5 折都没有最终用于训练和测试目的的信息。
library("mlr")
regr_task = makeRegrTask(data = mtcars, target = "hp")
learner = makeLearner("regr.ranger",
importance = "impurity",
num.threads = 3)
par_set = makeParamSet(
makeIntegerParam("num.trees", lower = 100L, upper = 500L),
makeIntegerParam("mtry", lower = 4L, upper = 8L)
)
rdesc = makeResampleDesc("CV", iters = 5, predict = "both")
meas = rmse
ctrl = makeTuneControlGrid()
set.seed(1)
tuned_model = tuneParams(learner = learner,
task = regr_task,
resampling = rdesc,
measures = list(meas, setAggregation(meas, train.mean)),
par.set = par_set,
control = ctrl,
show.info = FALSE)
tuned_model
model_rf = setHyperPars(learner = learner, par.vals = tuned_model$x)
set.seed(1)
model_rf = train(learner = model_rf, task = regr_task)
model_rf
tuned_model$resampling$train.inds
tuned_model$resampling$test.inds
你搞混了。
您正在将数据分成 5 份。每个折叠包含训练 和 测试数据。
这就是为什么您会为 $resampling$train.inds
和 $resampling$test.inds
返回一个 5 的列表。如果分成 5 份,您将在 4 个分区(80% 的数据)上进行训练,并在 1 个分区(20% 的数据)上进行评估。
正确的措辞是:"Which indices where used in which fold for training and testing?"。下面的代码回答了这个问题。
tuned_model$resampling$train.inds
[[1]]
[1] 10 32 6 15 20 28 26 12 8 24 31 27 22 2 13 29 17 11 1 3 16 18 21 19 9 5
[[2]]
[1] 10 6 15 28 26 12 23 30 8 25 24 7 31 27 14 2 13 29 17 1 16 4 21 19 9
[[3]]
[1] 10 32 20 26 12 23 30 8 25 7 27 22 14 2 13 29 17 11 1 3 16 18 4 19 5
[[4]]
[1] 32 6 15 20 28 26 12 23 30 25 24 7 31 22 14 13 17 11 1 3 18 4 21 19 9 5
[[5]]
[1] 10 32 6 15 20 28 23 30 8 25 24 7 31 27 22 14 2 29 11 3 16 18 4 21 9 5
> tuned_model$resampling$test.inds
[[1]]
[1] 4 7 14 23 25 30
[[2]]
[1] 3 5 11 18 20 22 32
[[3]]
[1] 6 9 15 21 24 28 31
[[4]]
[1] 2 8 10 16 27 29
[[5]]
[1] 1 12 13 17 19 26