使用 ranger 为多分类计算混淆矩阵或意外事件 table 时出错

Error in calculating confusion matrix or contigency table for multiclassification using ranger

我正在调用 ranger 来模拟大型混合数据框的多分类问题(其中一些分类变量超过 53 个级别)。训练和测试运行没有任何问题。但是,解释混淆矩阵/偶然性 table 会出现问题。

我使用虹膜数据来解释我面临的困难,将物种视为分类变量,

library(ranger)
library(caret)

# Data
idx = sample(nrow(iris),100)
data = iris

# Split data sets
Train_Set = data[idx,]
Test_Set = data[-idx,]

# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)

# Test
probabilitiesSpecies <- predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)
or
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)

遇到以下困难:

table(Test_Set$Species, probabilitiesSpecies$predictions)

Error in table(Test_Set$Species, probabilitiesSpecies$predictions) : 
all arguments must have the same length

caret::confusionMatrix(Test_Set$Species, probabilitiesSpecies$predictions)
or
caret::confusionMatrix(table(Test_Set$Species, max.col(probabilitiesSpecies)-1))
gives
Error: `data` and `reference` should be factors with the same levels.

但是,下面显示的双分类有效:

idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="virginica",1,0))

Train_Set = data[idx,]
Test_Set = data[-idx,]

# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)

# Test
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)

caret::confusionMatrix(table(max.col(probabilitiesSpecies)-1, Test_Set$Species))

多分类得到混淆矩阵如何解决这个问题?我也把它作为一个单独的线程 (Error while computing confusion matrix for multiclassification using ranger)

ranger-documentation中,当probabilities = TRUE,

时表示如下

With the probability option and factor dependent variable a probability forest is grown. Here, the node impurity is used for splitting, as in classification forests. Predictions are class probabilities for each sample. In contrast to other implementations, each tree returns a probability estimate and these estimates are averaged for the forest probability estimate. For details see Malley et al. (2012).

即。当设置为 TRUE 时,您将获得概率估计,然后您可以根据自己的阈值对其进行分类。但是,如果设置为FALSE,我不知道默认的决策规则。

无论如何,您的方法应该如下,

Species.ranger <- ranger(
        Species ~ .,
        data = Train_Set,
        importance ="impurity",
        save.memory = TRUE, 
        probability = FALSE
)

然后可以通过以下方式评估 confusionMatrix 中的性能,

probabilitiesSpecies <- predict(
        Species.ranger,
        data = Test_Set,
        verbose = TRUE
        )

table(
        probabilitiesSpecies$predictions,
        Test_Set$Species
) %>% confusionMatrix()

输出

Confusion Matrix and Statistics

            
             setosa versicolor virginica
  setosa         17          0         0
  versicolor      0         16         1
  virginica       0          0        16

Overall Statistics
                                          
               Accuracy : 0.98            
                 95% CI : (0.8935, 0.9995)
    No Information Rate : 0.34            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.97            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                   1.00            1.0000           0.9412
Specificity                   1.00            0.9706           1.0000
Pos Pred Value                1.00            0.9412           1.0000
Neg Pred Value                1.00            1.0000           0.9706
Prevalence                    0.34            0.3200           0.3400
Detection Rate                0.34            0.3200           0.3200
Detection Prevalence          0.34            0.3400           0.3200
Balanced Accuracy             1.00            0.9853           0.9706