混淆混淆矩阵参数改变输出
Confusing confusion matrix parameters changing output
我有 运行 预测 运行dom 森林模型。当我 运行 下面的代码时,我得到了两个不同的混淆矩阵——唯一的区别是一个我在预测函数中使用 data = train ,另一个我只使用 'train'。为什么这会造成如此大的不同——一个人的召回率要差得多。
conf.matrix <- table(train$Status,predict(fit2,train))
Pred:Churn Pred:Current
Actual:Churn 2543 984
Actual:Current 44 27206
conf.matrix <- table(train$Status,predict(fit2,data = train))
Pred:Churn Pred:Current
Actual:Churn 1609 1918
Actual:Current 464 26786
非常感谢。
第二个示例中的 data
参数被忽略,因为正确的参数名称是 newdata
正如@mtoto 和@agenis 所指出的。在没有 newdata
的情况下,predict.randomForest
将 return 模型的 袋外 预测。
这就是你想要做的。
来自 post CrossValidated:
Be aware that there's a difference between
predict(model)
and
predict(model, newdata=train)
when getting predictions for the training dataset. The first option gets the out-of-bag predictions from the random forest. This is generally what you want, when comparing predicted values to actuals on the training data.
The second treats your training data as if it was a new dataset, and runs the observations down each tree. This will result in an artificially close correlation between the predictions and the actuals, since the RF algorithm generally doesn't prune the individual trees, relying instead on the ensemble of trees to control overfitting. So don't do this if you want to get predictions on the training data.
我有 运行 预测 运行dom 森林模型。当我 运行 下面的代码时,我得到了两个不同的混淆矩阵——唯一的区别是一个我在预测函数中使用 data = train ,另一个我只使用 'train'。为什么这会造成如此大的不同——一个人的召回率要差得多。
conf.matrix <- table(train$Status,predict(fit2,train))
Pred:Churn Pred:Current
Actual:Churn 2543 984
Actual:Current 44 27206
conf.matrix <- table(train$Status,predict(fit2,data = train))
Pred:Churn Pred:Current
Actual:Churn 1609 1918
Actual:Current 464 26786
非常感谢。
第二个示例中的 data
参数被忽略,因为正确的参数名称是 newdata
正如@mtoto 和@agenis 所指出的。在没有 newdata
的情况下,predict.randomForest
将 return 模型的 袋外 预测。
这就是你想要做的。
来自 post CrossValidated:
Be aware that there's a difference between
predict(model)
and
predict(model, newdata=train)
when getting predictions for the training dataset. The first option gets the out-of-bag predictions from the random forest. This is generally what you want, when comparing predicted values to actuals on the training data.
The second treats your training data as if it was a new dataset, and runs the observations down each tree. This will result in an artificially close correlation between the predictions and the actuals, since the RF algorithm generally doesn't prune the individual trees, relying instead on the ensemble of trees to control overfitting. So don't do this if you want to get predictions on the training data.