如何在R中的randomForest中找到最好的ntree和nodesize，然后计算RMSE for confusion table作为结果？

Question

我有两个与运行R 中的 domForest 相关的问题。

如何找到两个参数的最佳值：ntree 和 nodesize？我只是在这里放一个运行dom 数字，有时我会找到更好的结果。我可以使用某种 k 折交叉验证吗？如果不能，我可以使用什么方法来找到这些值？
在我运行运行domForest函数有了模型之后，我做了预测，我有了一个预测数据，然后我就可以搞混了table 如下所示：

Predicted 1 2 3

 Actual   1  4 3 1

          2  2 4 2

          3  3 2 1

（即有 4 + 4 + 1 个正确预测）

我的问题是，鉴于这种 table，我如何计算预测的 RMSE（均方根误差）？当然我可以手动完成，但我认为这不是最佳答案。

非常感谢，

Answer 1

是的，您可以通过 k 折交叉验证 select 最佳参数。我建议不要调整 ntree 而只是将其设置得相对较高（1500-2000 棵树），因为过度拟合不是 RF 的问题，这样您就不必将其作为参数进行调整。您仍然可以继续调整 mtry.
评估分类问题性能的方法有很多。如果您对类似 RMSE 的度量特别感兴趣，您可以查看 this CV post, which discusses the Brier Score - 这是像 RMSE 一样计算的，您可以在其中使用预测的概率和实际值来获得均方误差。

Answer 2

您可以使用 mlr package. The tutorial has detailed sections on tuning and performance measurements. For tuning, you should use nested resampling 完成上述所有操作。

假设你有一个回归任务，它看起来像这样：

library(mlr)

# define parameters we want to tune -- you may want to adjust the bounds
ps = makeParamSet(
  makeIntegerLearnerParam(id = "ntree", default = 500L, lower = 1L, upper = 1000L),
  makeIntegerLearnerParam(id = "nodesize", default = 1L, lower = 1L, upper = 50L)
)

# random sampling of the configuration space with at most 100 samples
ctrl = makeTuneControlRandom(maxit = 100L)

# do a nested 3 fold cross-validation
inner = makeResampleDesc("CV", iters = 3L)
learner = makeTuneWrapper("regr.randomForest", resampling = inner, par.set = ps,
                          control = ctrl, show.info = FALSE, measures = rmse)

# outer resampling
outer = makeResampleDesc("CV", iters = 3)
# do the tuning, using the example boston housing task
res = resample(learner, bh.task, resampling = outer, extract = getTuneResult)

# show performance
print(performance(res$pred, measures = rmse))

整个分类过程看起来非常相似，详情请参阅相关教程页面。

如何在R中的randomForest中找到最好的ntree和nodesize，然后计算RMSE for confusion table作为结果？

How to find the best ntree and nodesize in randomForest in R, and then calculate RMSE for confusion table as the result?

r

random-forest