如何解释 R 中的混淆矩阵

Question

我正在处理混淆矩阵并且对输出有非常基本的了解。然而，由于我不熟悉使用它和 R，所以细节解释常常使它听起来更复杂。我有以下输出，我只是想知道是否可以向我解释一下

矩阵中的 TP、TN、FP 和 FN 是什么？
河童代表什么？

accuracy 和 kappa 有什么区别？

> confusionMatrix(predRF, loanTest2$grade)

Confusion Matrix and Statistics

          Reference
Prediction     A    B    C    D    E    F    G
 A          2298  174   63   29   26   12    3
 B           264 3245  301   65   16    3    3
 C             5  193 2958  399   61   15    4
 D             1    1   39 1074  236   33    6
 E             0    0    2   32  249   97   30
 F             0    0    0    0    8   21   11
 G             0    0    0    0    0    0    0

Overall Statistics

           Accuracy : 0.822          
             95% CI : (0.815, 0.8288)
No Information Rate : 0.3017         
P-Value [Acc > NIR] : < 2.2e-16      

               Kappa: 0.7635         

                     Class: A Class: B Class: C Class: D Class: E Class: F Class: G
Sensitivity            0.8949   0.8981   0.8796  0.67167  0.41779 0.116022 0.000000
Specificity            0.9674   0.9220   0.9214  0.96955  0.98585 0.998389 1.000000
Pos Pred Value         0.8821   0.8327   0.8138  0.77266  0.60732 0.525000      NaN
Neg Pred Value         0.9712   0.9545   0.9515  0.95041  0.97000 0.986596 0.995241
Prevalence             0.2144   0.3017   0.2808  0.13351  0.04976 0.015112 0.004759
Detection Rate         0.1919   0.2709   0.2470  0.08967  0.02079 0.001753 0.000000
Detection Prevalence   0.2175   0.3254   0.3035  0.11606  0.03423 0.003340 0.000000
Balanced Accuracy      0.9311   0.9101   0.9005  0.82061  0.70182 0.557206 0.500000

Answer 1

假设这是你的混淆矩阵：

tab = structure(list(A = c(2298L, 264L, 5L, 1L, 0L, 0L, 0L), B = c(174L, 
3245L, 193L, 1L, 0L, 0L, 0L), C = c(63L, 301L, 2958L, 39L, 2L, 
0L, 0L), D = c(29L, 65L, 399L, 1074L, 32L, 0L, 0L), E = c(26L, 
16L, 61L, 236L, 249L, 8L, 0L), F = c(12L, 3L, 15L, 33L, 97L, 
21L, 0L), G = c(3L, 3L, 4L, 6L, 30L, 11L, 0L)), class = "data.frame", row.names = c("A", 
"B", "C", "D", "E", "F", "G"))

Matrix 中的 TP、TN、FP 和 FN 是什么？

您需要按照每个标签进行操作，例如对于 class A，这些术语在对 A 的预测方面是有意义的。

A_confusion_matrix = cbind(c(x[1,1],sum(x[-1,1])),c(sum(x[1,-1]),sum(x[2:7,2:7])))

     [,1] [,2]
[1,] 2298  307
[2,]  270 9102

上面的计算方式基本上是将所有预测和参考都错误地混为一谈，而不是 A。

而这些数字代表：

A_confusion_matrix[1,1] is number that are predicted A and truly A -> TP

A_confusion_matrix[1,2] is the number that are predicted A but not A -> FP

A_confusion_matrix[2,1] is the number that are not predicted A but A -> FN

A_confusion_matrix[2,2] is the number that are not predicted A and not A -> TN

例如，您可以从此处计算 A 的灵敏度，即 TP/(TP+FN) = 2298/(2298+270) = 0.8948598

河童代表什么？

它是 cohen's kappa，基本上是衡量您的预测与随机猜测/分配相比有多好的指标。

accuracy 和 kappa 有什么区别？

从上面的公式可以看出，当你的数据集不平衡时，它会产生巨大的差异。例如，如果 90% 的标签是一个 class，如果模型预测所有内容都是 class，那么您将获得 90% 的准确率。但是，如果您使用 cohen 的 kappa，开始时 p 预期为 0.9，您需要做得更好才能显示良好的分数。

如何解释 R 中的混淆矩阵

How to interpret confusion matrix in R

r

confusion-matrix

multiclass-classification