随机森林的 varImp(插入符号)和重要性(randomForest)之间的区别
Difference between varImp (caret) and importance (randomForest) for Random Forest
我不明白随机森林模型的 varImp
函数(caret
包)和 importance
函数(randomForest
包)之间的区别是什么:
我计算了一个简单的 RF 分类模型,在计算变量重要性时,我发现两个函数的预测变量 "ranking" 不同:
这是我的代码:
rfImp <- randomForest(Origin ~ ., data = TAll_CS,
ntree = 2000,
importance = TRUE)
importance(rfImp)
BREAST LUNG MeanDecreaseAccuracy MeanDecreaseGini
Energy_GLCM_R1SC4NG3 -1.44116806 2.8918537 1.0929302 0.3712622
Contrast_GLCM_R1SC4NG3 -2.61146974 1.5848150 -0.4455327 0.2446930
Entropy_GLCM_R1SC4NG3 -3.42017102 3.8839464 0.9779201 0.4170445
...
varImp(rfImp)
BREAST LUNG
Energy_GLCM_R1SC4NG3 0.72534283 0.72534283
Contrast_GLCM_R1SC4NG3 -0.51332737 -0.51332737
Entropy_GLCM_R1SC4NG3 0.23188771 0.23188771
...
我以为他们使用相同的 "algorithm" 但我现在不确定。
编辑
为了重现问题,可以使用ionosphere
数据集(kknn包):
library(kknn)
data(ionosphere)
rfImp <- randomForest(class ~ ., data = ionosphere[,3:35],
ntree = 2000,
importance = TRUE)
importance(rfImp)
b g MeanDecreaseAccuracy MeanDecreaseGini
V3 21.3106205 42.23040 42.16524 15.770711
V4 10.9819574 28.55418 29.28955 6.431929
V5 30.8473944 44.99180 46.64411 22.868543
V6 11.1880372 33.01009 33.18346 6.999027
V7 13.3511887 32.22212 32.66688 14.100210
V8 11.8883317 32.41844 33.03005 7.243705
V9 -0.5020035 19.69505 19.54399 2.501567
V10 -2.9051578 22.24136 20.91442 2.953552
V11 -3.9585608 14.68528 14.11102 1.217768
V12 0.8254453 21.17199 20.75337 3.298964
...
varImp(rfImp)
b g
V3 31.770511 31.770511
V4 19.768070 19.768070
V5 37.919596 37.919596
V6 22.099063 22.099063
V7 22.786656 22.786656
V8 22.153388 22.153388
V9 9.596522 9.596522
V10 9.668101 9.668101
V11 5.363359 5.363359
V12 10.998718 10.998718
...
我想我遗漏了什么...
编辑 2
我发现如果你对 importance(rfImp)
的前两列的每一行求平均值,你会得到 varImp(rfImp)
:
的结果
impRF <- importance(rfImp)[,1:2]
apply(impRF, 1, function(x) mean(x))
V3 V4 V5 V6 V7 V8 V9
31.770511 19.768070 37.919596 22.099063 22.786656 22.153388 9.596522
V10 V11 V12
9.668101 5.363359 10.998718 ...
# Same result as in both columns of varImp(rfImp)
我不知道为什么会这样,但必须要有一个解释。
我没有您的确切数据,但使用虚拟数据(见下文)我无法重现此行为。也许仔细检查你真的没有做任何其他可能影响你结果的事情。您使用哪个版本的 R 和插入符号?
library(caret)
library(randomForest)
# classification - same result
rfImp1 <- randomForest(Species ~ ., data = iris[,1:5],
ntree = 2000,
importance = TRUE)
importance(rfImp1)
varImp(rfImp1)
# regression - same result
rfImp2 <- randomForest(Sepal.Length ~ ., data = iris[,1:4],
ntree = 2000,
importance = TRUE)
importance(rfImp2)
varImp(rfImp2)
更新:
使用 Ionosphere
数据,这是可重现的:
library(caret)
library(randomForest)
library(mlbench)
data(Ionosphere)
str(Ionosphere)
rfImp1 <- randomForest(Class ~ ., data = Ionosphere[,3:35], ntree = 2000, importance = TRUE)
...这些结果:
> head(importance(rfImp1))
bad good MeanDecreaseAccuracy MeanDecreaseGini
V3 20.545836 41.43872 41.26313 15.308791
V4 10.615291 29.31543 29.58395 6.226591
V5 29.508581 44.86784 46.79365 21.757928
V6 9.231544 31.77881 31.48614 7.201694
V7 12.461476 34.39334 34.92728 14.802564
V8 12.944721 32.49392 33.35699 6.971502
> head(varImp(rfImp1))
bad good
V3 30.99228 30.99228
V4 19.96536 19.96536
V5 37.18821 37.18821
V6 20.50518 20.50518
V7 23.42741 23.42741
V8 22.71932 22.71932
我的猜测是 caret 和 randomForest 只是使用不同的方式来聚合每个变量的不同运行的结果 - 但 @topepo 现在很可能会给你一个准确的答案。
如果我们遍历 varImp 的方法:
检查对象:
> getFromNamespace('varImp','caret')
function (object, ...)
{
UseMethod("varImp")
}
获取 S3 方法:
> getS3method('varImp','randomForest')
function (object, ...)
{
code <- varImpDependencies("rf")
code$varImp(object, ...)
}
<environment: namespace:caret>
code <- caret:::varImpDependencies('rf')
> code$varImp
function(object, ...){
varImp <- randomForest::importance(object, ...)
if(object$type == "regression")
varImp <- data.frame(Overall = varImp[,"%IncMSE"])
else {
retainNames <- levels(object$y)
if(all(retainNames %in% colnames(varImp))) {
varImp <- varImp[, retainNames]
} else {
varImp <- data.frame(Overall = varImp[,1])
}
}
out <- as.data.frame(varImp)
if(dim(out)[2] == 2) {
tmp <- apply(out, 1, mean)
out[,1] <- out[,2] <- tmp
}
out
}
所以这不是严格返回 randomForest::importance,
它首先计算它,然后只选择数据集中的分类值。
然后它做了一些有趣的事情,它检查我们是否只有两列:
if(dim(out)[2] == 2) {
tmp <- apply(out, 1, mean)
out[,1] <- out[,2] <- tmp
}
根据 varImp 手册页:
Random Forest: varImp.randomForest and varImp.RandomForest are
wrappers around the importance functions from the randomForest and
party packages, respectively.
显然不是这样。
至于为什么...
如果我们只有两个值,则变量作为预测变量的重要性可以表示为一个值。
如果变量是 g
的预测变量,那么它也必须是 b
的预测变量
它确实有道理,但这不符合他们关于函数功能的文档,所以我可能会将此报告为意外行为。当您期望自己进行相对计算时,该函数会尝试提供帮助。
这个答案是对@Shape 的解决方案的补充。我认为 importance
遵循 Breiman 的 well-known 方法来计算报告为 MeanDecreaseAccuracy
的变量重要性,即对于每棵树的 out-of-bag 样本计算树的准确性,然后依次排列变量并测量排列后的准确度,以计算没有该变量时准确度的下降。
我无法找到有关如何准确计算第一列中 class-specific 准确度下降的很多信息,但我认为它是 正确预测的 class k / 总预测 classk。
正如@Shape 解释的那样,varImp
不报告 importance
报告的 MeanDecreaseAccuracy
,而是计算(缩放的)class-specific 精度下降的平均值并为每个 classes 报告它。 (对于超过 2 class 的情况,varImp
仅报告准确性下降 class-specific。)
仅当 class 分布相等时,此方法才相似。原因是只有在平衡的情况下,一个 class 的准确度降低才会同样降低另一个 class.
的准确度
library(caret)
library(randomForest)
library(mlbench)
### Balanced sample size ###
data(Ionosphere)
rfImp1 <- randomForest(Class ~ ., data = Ionosphere[,3:35], ntree = 1000, importance = TRUE)
# How importance() calculates the overall decerase in accuracy for the variable
Imp1 <- importance(rfImp1, scale = FALSE)
summary(Ionosphere$Class)/nrow(Ionosphere)
classRatio1 <- summary(Ionosphere$Class)/nrow(Ionosphere)
# bad good
#0.3589744 0.6410256
# Caret calculates a simple mean
varImp(rfImp1, scale = FALSE)["V3",] # 0.04542253
Imp1["V3", "bad"] * 0.5 + Imp1["V3", "good"] * 0.5 # 0.04542253
# importance is closer to the weighted average of class importances
Imp1["V3", ] # 0.05262225
Imp1["V3", "bad"] * classRatio1[1] + Imp1["V3", "good"] * classRatio1[2] # 0.05274091
### Equal sample size ###
Ionosphere2 <- Ionosphere[c(which(Ionosphere$Class == "good"), sample(which(Ionosphere$Class == "bad"), 225, replace = TRUE)),]
summary(Ionosphere2$Class)/nrow(Ionosphere2)
classRatio2 <- summary(Ionosphere2$Class)/nrow(Ionosphere2)
# bad good
# 0.5 0.5
rfImp2 <- randomForest(Class ~ ., data = Ionosphere2[,3:35], ntree = 1000, importance = TRUE)
Imp2 <- importance(rfImp2, scale = FALSE)
# Caret calculates a simple mean
varImp(rfImp2, scale = FALSE)["V3",] # 0.06126641
Imp2["V3", "bad"] * 0.5 + Imp2["V3", "good"] * 0.5 # 0.06126641
# As does the average adjusted for the balanced class ratio
Imp2["V3", "bad"] * classRatio2[1] + Imp2["V3", "good"] * classRatio2[2] # 0.06126641
# There is now not much difference between the measure for balanced classes
Imp2["V3",] # 0.06106229
我认为这可以解释为插入符号对所有 class 赋予了相同的权重,而 importance
将变量报告为更重要,如果它们对更常见的 class 很重要的话。我倾向于同意 Max Kuhn 的观点,但应该在文档中的某处解释差异。
https://www.r-bloggers.com/variable-importance-plot-and-variable-selection/
在给定的 link 中,已表明当您未在模型中指定 importance =TRUE 时,您会使用 randomForest 和 Caret 包
获得相同的平均减少基尼值
我不明白随机森林模型的 varImp
函数(caret
包)和 importance
函数(randomForest
包)之间的区别是什么:
我计算了一个简单的 RF 分类模型,在计算变量重要性时,我发现两个函数的预测变量 "ranking" 不同:
这是我的代码:
rfImp <- randomForest(Origin ~ ., data = TAll_CS,
ntree = 2000,
importance = TRUE)
importance(rfImp)
BREAST LUNG MeanDecreaseAccuracy MeanDecreaseGini
Energy_GLCM_R1SC4NG3 -1.44116806 2.8918537 1.0929302 0.3712622
Contrast_GLCM_R1SC4NG3 -2.61146974 1.5848150 -0.4455327 0.2446930
Entropy_GLCM_R1SC4NG3 -3.42017102 3.8839464 0.9779201 0.4170445
...
varImp(rfImp)
BREAST LUNG
Energy_GLCM_R1SC4NG3 0.72534283 0.72534283
Contrast_GLCM_R1SC4NG3 -0.51332737 -0.51332737
Entropy_GLCM_R1SC4NG3 0.23188771 0.23188771
...
我以为他们使用相同的 "algorithm" 但我现在不确定。
编辑
为了重现问题,可以使用ionosphere
数据集(kknn包):
library(kknn)
data(ionosphere)
rfImp <- randomForest(class ~ ., data = ionosphere[,3:35],
ntree = 2000,
importance = TRUE)
importance(rfImp)
b g MeanDecreaseAccuracy MeanDecreaseGini
V3 21.3106205 42.23040 42.16524 15.770711
V4 10.9819574 28.55418 29.28955 6.431929
V5 30.8473944 44.99180 46.64411 22.868543
V6 11.1880372 33.01009 33.18346 6.999027
V7 13.3511887 32.22212 32.66688 14.100210
V8 11.8883317 32.41844 33.03005 7.243705
V9 -0.5020035 19.69505 19.54399 2.501567
V10 -2.9051578 22.24136 20.91442 2.953552
V11 -3.9585608 14.68528 14.11102 1.217768
V12 0.8254453 21.17199 20.75337 3.298964
...
varImp(rfImp)
b g
V3 31.770511 31.770511
V4 19.768070 19.768070
V5 37.919596 37.919596
V6 22.099063 22.099063
V7 22.786656 22.786656
V8 22.153388 22.153388
V9 9.596522 9.596522
V10 9.668101 9.668101
V11 5.363359 5.363359
V12 10.998718 10.998718
...
我想我遗漏了什么...
编辑 2
我发现如果你对 importance(rfImp)
的前两列的每一行求平均值,你会得到 varImp(rfImp)
:
impRF <- importance(rfImp)[,1:2]
apply(impRF, 1, function(x) mean(x))
V3 V4 V5 V6 V7 V8 V9
31.770511 19.768070 37.919596 22.099063 22.786656 22.153388 9.596522
V10 V11 V12
9.668101 5.363359 10.998718 ...
# Same result as in both columns of varImp(rfImp)
我不知道为什么会这样,但必须要有一个解释。
我没有您的确切数据,但使用虚拟数据(见下文)我无法重现此行为。也许仔细检查你真的没有做任何其他可能影响你结果的事情。您使用哪个版本的 R 和插入符号?
library(caret)
library(randomForest)
# classification - same result
rfImp1 <- randomForest(Species ~ ., data = iris[,1:5],
ntree = 2000,
importance = TRUE)
importance(rfImp1)
varImp(rfImp1)
# regression - same result
rfImp2 <- randomForest(Sepal.Length ~ ., data = iris[,1:4],
ntree = 2000,
importance = TRUE)
importance(rfImp2)
varImp(rfImp2)
更新:
使用 Ionosphere
数据,这是可重现的:
library(caret)
library(randomForest)
library(mlbench)
data(Ionosphere)
str(Ionosphere)
rfImp1 <- randomForest(Class ~ ., data = Ionosphere[,3:35], ntree = 2000, importance = TRUE)
...这些结果:
> head(importance(rfImp1))
bad good MeanDecreaseAccuracy MeanDecreaseGini
V3 20.545836 41.43872 41.26313 15.308791
V4 10.615291 29.31543 29.58395 6.226591
V5 29.508581 44.86784 46.79365 21.757928
V6 9.231544 31.77881 31.48614 7.201694
V7 12.461476 34.39334 34.92728 14.802564
V8 12.944721 32.49392 33.35699 6.971502
> head(varImp(rfImp1))
bad good
V3 30.99228 30.99228
V4 19.96536 19.96536
V5 37.18821 37.18821
V6 20.50518 20.50518
V7 23.42741 23.42741
V8 22.71932 22.71932
我的猜测是 caret 和 randomForest 只是使用不同的方式来聚合每个变量的不同运行的结果 - 但 @topepo 现在很可能会给你一个准确的答案。
如果我们遍历 varImp 的方法:
检查对象:
> getFromNamespace('varImp','caret')
function (object, ...)
{
UseMethod("varImp")
}
获取 S3 方法:
> getS3method('varImp','randomForest')
function (object, ...)
{
code <- varImpDependencies("rf")
code$varImp(object, ...)
}
<environment: namespace:caret>
code <- caret:::varImpDependencies('rf')
> code$varImp
function(object, ...){
varImp <- randomForest::importance(object, ...)
if(object$type == "regression")
varImp <- data.frame(Overall = varImp[,"%IncMSE"])
else {
retainNames <- levels(object$y)
if(all(retainNames %in% colnames(varImp))) {
varImp <- varImp[, retainNames]
} else {
varImp <- data.frame(Overall = varImp[,1])
}
}
out <- as.data.frame(varImp)
if(dim(out)[2] == 2) {
tmp <- apply(out, 1, mean)
out[,1] <- out[,2] <- tmp
}
out
}
所以这不是严格返回 randomForest::importance,
它首先计算它,然后只选择数据集中的分类值。
然后它做了一些有趣的事情,它检查我们是否只有两列:
if(dim(out)[2] == 2) {
tmp <- apply(out, 1, mean)
out[,1] <- out[,2] <- tmp
}
根据 varImp 手册页:
Random Forest: varImp.randomForest and varImp.RandomForest are wrappers around the importance functions from the randomForest and party packages, respectively.
显然不是这样。
至于为什么...
如果我们只有两个值,则变量作为预测变量的重要性可以表示为一个值。
如果变量是 g
的预测变量,那么它也必须是 b
它确实有道理,但这不符合他们关于函数功能的文档,所以我可能会将此报告为意外行为。当您期望自己进行相对计算时,该函数会尝试提供帮助。
这个答案是对@Shape 的解决方案的补充。我认为 importance
遵循 Breiman 的 well-known 方法来计算报告为 MeanDecreaseAccuracy
的变量重要性,即对于每棵树的 out-of-bag 样本计算树的准确性,然后依次排列变量并测量排列后的准确度,以计算没有该变量时准确度的下降。
我无法找到有关如何准确计算第一列中 class-specific 准确度下降的很多信息,但我认为它是 正确预测的 class k / 总预测 classk。
正如@Shape 解释的那样,varImp
不报告 importance
报告的 MeanDecreaseAccuracy
,而是计算(缩放的)class-specific 精度下降的平均值并为每个 classes 报告它。 (对于超过 2 class 的情况,varImp
仅报告准确性下降 class-specific。)
仅当 class 分布相等时,此方法才相似。原因是只有在平衡的情况下,一个 class 的准确度降低才会同样降低另一个 class.
library(caret)
library(randomForest)
library(mlbench)
### Balanced sample size ###
data(Ionosphere)
rfImp1 <- randomForest(Class ~ ., data = Ionosphere[,3:35], ntree = 1000, importance = TRUE)
# How importance() calculates the overall decerase in accuracy for the variable
Imp1 <- importance(rfImp1, scale = FALSE)
summary(Ionosphere$Class)/nrow(Ionosphere)
classRatio1 <- summary(Ionosphere$Class)/nrow(Ionosphere)
# bad good
#0.3589744 0.6410256
# Caret calculates a simple mean
varImp(rfImp1, scale = FALSE)["V3",] # 0.04542253
Imp1["V3", "bad"] * 0.5 + Imp1["V3", "good"] * 0.5 # 0.04542253
# importance is closer to the weighted average of class importances
Imp1["V3", ] # 0.05262225
Imp1["V3", "bad"] * classRatio1[1] + Imp1["V3", "good"] * classRatio1[2] # 0.05274091
### Equal sample size ###
Ionosphere2 <- Ionosphere[c(which(Ionosphere$Class == "good"), sample(which(Ionosphere$Class == "bad"), 225, replace = TRUE)),]
summary(Ionosphere2$Class)/nrow(Ionosphere2)
classRatio2 <- summary(Ionosphere2$Class)/nrow(Ionosphere2)
# bad good
# 0.5 0.5
rfImp2 <- randomForest(Class ~ ., data = Ionosphere2[,3:35], ntree = 1000, importance = TRUE)
Imp2 <- importance(rfImp2, scale = FALSE)
# Caret calculates a simple mean
varImp(rfImp2, scale = FALSE)["V3",] # 0.06126641
Imp2["V3", "bad"] * 0.5 + Imp2["V3", "good"] * 0.5 # 0.06126641
# As does the average adjusted for the balanced class ratio
Imp2["V3", "bad"] * classRatio2[1] + Imp2["V3", "good"] * classRatio2[2] # 0.06126641
# There is now not much difference between the measure for balanced classes
Imp2["V3",] # 0.06106229
我认为这可以解释为插入符号对所有 class 赋予了相同的权重,而 importance
将变量报告为更重要,如果它们对更常见的 class 很重要的话。我倾向于同意 Max Kuhn 的观点,但应该在文档中的某处解释差异。
https://www.r-bloggers.com/variable-importance-plot-and-variable-selection/ 在给定的 link 中,已表明当您未在模型中指定 importance =TRUE 时,您会使用 randomForest 和 Caret 包
获得相同的平均减少基尼值