How to find the best ntree and nodesize in randomForest in R, and then calculate RMSE for confusion table as the result?

Predicted 1 2 3

 Actual   1  4 3 1

          2  2 4 2

          3  3 2 1

(即有 4 + 4 + 1 个正确预测)

  1. 是的,您可以通过 k 折交叉验证 select 最佳参数。我建议不要调整 ntree 而只是将其设置得相对较高(1500-2000 棵树),因为过度拟合不是 RF 的问题,这样您就不必将其作为参数进行调整。您仍然可以继续调整 mtry.

  2. 评估分类问题性能的方法有很多。如果您对类似 RMSE 的度量特别感兴趣,您可以查看 this CV post, which discusses the Brier Score - 这是像 RMSE 一样计算的,您可以在其中使用预测的概率和实际值来获得均方误差。

您可以使用 mlr package. The tutorial has detailed sections on tuning and performance measurements. For tuning, you should use nested resampling 完成上述所有操作。



# define parameters we want to tune -- you may want to adjust the bounds
ps = makeParamSet(
  makeIntegerLearnerParam(id = "ntree", default = 500L, lower = 1L, upper = 1000L),
  makeIntegerLearnerParam(id = "nodesize", default = 1L, lower = 1L, upper = 50L)

# random sampling of the configuration space with at most 100 samples
ctrl = makeTuneControlRandom(maxit = 100L)

# do a nested 3 fold cross-validation
inner = makeResampleDesc("CV", iters = 3L)
learner = makeTuneWrapper("regr.randomForest", resampling = inner, par.set = ps,
                          control = ctrl, show.info = FALSE, measures = rmse)

# outer resampling
outer = makeResampleDesc("CV", iters = 3)
# do the tuning, using the example boston housing task
res = resample(learner, bh.task, resampling = outer, extract = getTuneResult)

# show performance
print(performance(res$pred, measures = rmse))
