混淆混淆矩阵参数改变输出

Question

我有运行预测运行dom 森林模型。当我运行下面的代码时，我得到了两个不同的混淆矩阵——唯一的区别是一个我在预测函数中使用 data = train ，另一个我只使用 'train'。为什么这会造成如此大的不同——一个人的召回率要差得多。

conf.matrix <- table(train$Status,predict(fit2,train))

               Pred:Churn Pred:Current
  Actual:Churn         2543          984
  Actual:Current         44        27206

conf.matrix <- table(train$Status,predict(fit2,data = train))

                Pred:Churn Pred:Current
  Actual:Churn         1609         1918
  Actual:Current        464        26786

非常感谢。

Answer 1

第二个示例中的 data 参数被忽略，因为正确的参数名称是 newdata 正如@mtoto 和@agenis 所指出的。在没有 newdata 的情况下，predict.randomForest 将 return 模型的袋外预测。

这就是你想要做的。

来自 post CrossValidated:

Be aware that there's a difference between
predict(model)
and
predict(model, newdata=train)
when getting predictions for the training dataset. The first option gets the out-of-bag predictions from the random forest. This is generally what you want, when comparing predicted values to actuals on the training data.

The second treats your training data as if it was a new dataset, and runs the observations down each tree. This will result in an artificially close correlation between the predictions and the actuals, since the RF algorithm generally doesn't prune the individual trees, relying instead on the ensemble of trees to control overfitting. So don't do this if you want to get predictions on the training data.

混淆混淆矩阵参数改变输出

Confusing confusion matrix parameters changing output

r

machine-learning

confusion-matrix

random-forest