计算随机森林训练集 AUC 的两种不同方法会给我不同的结果吗？

Question

我使用了两种方法来计算 randomForest 上训练集的 AUC，但得到的结果截然不同。两种方式如下：

rfmodel <- randomForest(y~., data=train, importance=TRUE, ntree=1000)

训练集AUC计算方式一：

`rf_p_train <- predict(rfmodel, type="prob",newdata = train)[,'yes']  
 rf_pr_train <- prediction(rf_p_train, train$y)  
 r_auc_train[i] <- performance(rf_pr_train, measure = "auc")@y.values[[1]] `

训练集AUC计算方式二：
rf_p_train <- as.vector(rfmodel$votes[,2]) rf_pr_train <- prediction(rf_p_train, train$y) r_auc_train[i] <- performance(rf_pr_train, measure = "auc")@y.values[[1]]

方式 1 给我的 AUC 大约为 1，但是方式 2 给我的 AUC 大约为 0.65。我想知道为什么这两个结果差异如此之大。谁能帮我解决这个问题？真的很感激。对于数据，很抱歉，我不能在这里分享。这是我第一次在这里提问。如有不明之处请见谅。非常感谢！

Answer 1

我不确定您使用的是什么数据。最好提供一个可重现的示例，但我认为我能够将它们拼凑起来

library(randomForest)
#install.packages("ModelMetrics")
library(ModelMetrics)

# prep training to binary outcome
train <- iris[iris$Species %in% c('virginica', 'versicolor'),]
train$Species <- droplevels(train$Species)

# build model
rfmodel <- randomForest(Species~., data=train, importance=TRUE, ntree=2)

# generate predictions
preds <- predict(rfmodel, type="prob",newdata = train)[,2]

# Calculate AUC
auc(train$Species, preds)

# Calculate LogLoss
logLoss(train$Species, preds)

Answer 2

好的。第二种方式是正确的。为什么？因为在第一种方式中，你把training个数据当做一个新的数据集，然后尝试再次拟合。第二种方式，得到的其实就是所谓的out of bag估计值，应该就是AUC的计算方式。

计算随机森林训练集 AUC 的两种不同方法会给我不同的结果吗？

Two different ways to calculate the AUC of training set on randomforest give me different results?

r

prediction

random-forest

auc