R中One-R分类模型的灵敏度和特异性计算和决策矩阵

Sensitivity and Specificity Calculation and Decision Matrix for the One-R Classification Model in R

我没有看到类似的问题,在 this question the positive result turned out not to be specified. 中是类似的,是我自己问的,但它是针对不同的问题,那个是 Zero-R 数据集,我似乎有同样的问题One R 的问题,这个可能更清晰。我的问题是为什么我的结果与我的预期不同以及我的 One Rule 模型是否正常运行——有一条警告消息我不确定我是否需要解决,但具体来说有两个冲突的混淆矩阵不需要关联,灵敏度和特异性的手动计算与 caret 包中的 confusionMatrix() 函数的特异性和灵敏度计算不匹配。看起来有些东西倒置了,但我会继续检查。非常感谢任何建议!

对于上下文,One Rule 模型测试癌症数据集的每个属性或列,例如,纹理是否在混淆矩阵中产生良性 (B) 预测与恶性 (M) 预测的最准确结果,或者是平滑度、面积还是其他一些因素,每个因素都表示为每列中的原始数据。

有这个警告,我的假设是我可以添加更多参数,但我没有完全理解它们:

oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = verbose
#> data contains unused factor levels

这里有两个单独的混淆矩阵,它们可能具有倒置的标签,并且每个都给出不同的特异性和敏感性结果,一个是我手动完成的,另一个是使用 caret 包中的 confusionMatrix() 函数完成的:

table(dataTest$Diagnosis, dataTest.pred)
#> dataTest.pred
#>     B  M
#>  B 28  1
#>  M  5 12
 
 #OneR(formula, data, subset, na.action,
 #     control = Weka_control(), options = NULL)
 
 
confusionMatrix(dataTest.pred, as.factor(dataTest$Diagnosis), positive="B") 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  B  M
#>          B 28  5
#>          M  1 12
#>                                          
#>                Accuracy : 0.8696          
#>                  95% CI : (0.7374, 0.9506)
#>     No Information Rate : 0.6304          
#>     P-Value [Acc > NIR] : 0.0003023       
#>                                          
#>                   Kappa : 0.7058          
#>                                          
#>  Mcnemar's Test P-Value : 0.2206714       
#>                                           
#>             Sensitivity : 0.9655          
#>             Specificity : 0.7059          
#>          Pos Pred Value : 0.8485          
#>          Neg Pred Value : 0.9231          
#>              Prevalence : 0.6304          
#>          Detection Rate : 0.6087          
#>    Detection Prevalence : 0.7174          
#>       Balanced Accuracy : 0.8357          
#>                                          
#>        'Positive' Class : B               
#>                                          
 
sensitivity1 = 28/(28+5)
specificity1 = 12/(12+1)
specificity1
#> [1] 0.9230769
sensitivity1
#> [1] 0.8484848

这是伪代码,我的假设是 OneR 函数已经完成的工作,我不应该手动执行此操作:

For each attribute, 
  For each value of the attribute, make a rule as follows:
      count how often each class appears 
      find the most frequent class 
      make the rule assign that class to this attribute-value
  Calculate the error rate of the rules Choose the rules with the smallest error rate

这是 One R 模型的其余代码:

 #--------------------------------------------------
 #     One R Model
 #--------------------------------------------------
 
 
 set.seed(23)
 randsamp <- sample(nrow(cancerdata), 150, replace=FALSE)
 #randsamp
 
 cancersamp <- cancerdata[randsamp,]
 #cancersamp
 
 #?sample.split
 
 spl = sample.split(cancersamp$Diagnosis, SplitRatio = 0.7)
 #spl

dataTrain = subset(cancersamp, spl==TRUE)
dataTest = subset(cancersamp, spl==FALSE)
 
 oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = #> verbose,  :
#>   data contains unused factor levels
summary(oneRModel)

#> Call:
#> OneR.formula(formula = as.factor(Diagnosis) ~ ., data = cancersamp)

#> Rules:
#> If perimeter = (53.2,75.7] then as.factor(Diagnosis) = B
#> If perimeter = (75.7,98.2] then as.factor(Diagnosis) = B
#> If perimeter = (98.2,121]  then as.factor(Diagnosis) = M
#> If perimeter = (121,143]   then as.factor(Diagnosis) = M
#> If perimeter = (143,166]   then as.factor(Diagnosis) = M

#> Accuracy:
#> 134 of 150 instances classified correctly (89.33%)

#> Contingency table:
#>                     perimeter
#> as.factor(Diagnosis) (53.2,75.7] (75.7,98.2] (98.2,121] (121,143] #> (143,166] Sum
#>                 B          * 31        * 63          1         0         0  95
#>                 M             1          14       * 19      * 18       * 3  55
#>                 Sum          32          77         20        18         3 150
#> ---
#> Maximum in each column: '*'

#> Pearson's Chi-squared test:
#> X-squared = 92.412, df = 4, p-value < 2.2e-16

dataTest.pred <- predict(oneRModel, newdata = dataTest)
table(dataTest$Diagnosis, dataTest.pred)
#>   dataTest.pred
#>      B  M
#>   B 28  1
#>   M  5 12

这是数据集的一小段,如您所见,周长是所选的唯一规则因素,但我希望结果与研究对纹理、面积和平滑度的预测相关,从而提供最佳结果,但我不知道研究中围绕它的所有变量,这些是随机样本,所以我总是可以继续测试。

head(cancerdata)
  PatientID radius texture perimeter   area smoothness compactness concavity concavePoints symmetry  fractalDimension Diagnosis
1    842302  17.99   10.38    122.80 1001.0    0.11840     0.27760    0.3001       0.14710   0.2419          0.07871         M
2    842517  20.57   17.77    132.90 1326.0    0.08474     0.07864    0.0869       0.07017   0.1812          0.05667         M
3  84300903  19.69   21.25    130.00 1203.0    0.10960     0.15990    0.1974       0.12790   0.2069          0.05999         M
4  84348301  11.42   20.38     77.58  386.1    0.14250     0.28390    0.2414       0.10520   0.2597          0.09744         M
5  84358402  20.29   14.34    135.10 1297.0    0.10030     0.13280    0.1980       0.10430   0.1809          0.05883         M
6    843786  12.45   15.70     82.57  477.1    0.12780     0.17000    0.1578       0.08089   0.2087          0.07613         M

This website 提供了一些信息,但对于 OneR 模型,很难弄清楚要使用哪个矩阵,两者都有相似的特异性和敏感性计算,并且它们的混淆矩阵都有相似的 tables .

然而, is another problem with the confusion matrix issue and just cleared up which one is correct. This Zero R matrix looked wrong because it says sensitivity is 1.00 and specificity is 0.00, while my results were that sensitivity was along the lines of 0.6246334 among multiple trials with 0.00 for specificity. But this website实际上清除了它,因为 Zero-R 模型的因子为零,灵敏度实际上仅为 1.00,特异性为 0.00。它给出了一个预测,并且只是基于大多数人。

交叉应用其中 table 在 Zero-R 模型上是正确的到 One-R 模型,正确的是基于以相同方式完成的相同 confusionMatrix() 函数:

> confusionMatrix(dataTest.pred, as.factor(dataTest$Diagnosis), positive="B") 
Confusion Matrix and Statistics

          Reference
Prediction  B  M
         B 28  5
         M  1 12

这些是正确的计算,与零 R 模型的 1.00 灵敏度和 0.00 特异性相关:

Sensitivity : 0.9655          
Specificity : 0.7059

对于我的两个问题,Zero-R 和 One-R,这一个都做错了,大概是因为参数没有正确完成:

> dataTest.pred <- predict(oneRModel, newdata = dataTest)
> table(dataTest$Diagnosis, dataTest.pred)
   dataTest.pred
     B  M
  B 28  1
  M  5 12

根据https://topepo.github.io/caret/measuring-performance.html

灵敏度是真阳性率(预测positives/total阳性);在这种情况下,当您告诉 confusionMatrix() “正”class 是“B”时:28/(28 + 1) = 0.9655

特异性为真阴性率(预测negatives/total个阴性);在这种情况下,当您告诉 confusionMatrix() “正”class 是“B”时:12/(12 + 5) = 0.7059

看起来不一致是因为 OneR/manual 混淆矩阵制表相对于 confusionMatrix() 生成的矩阵是倒置的。您的手动计算似乎也不正确,因为您除以总 true/false 个预测而不是总 true/false 个值。