如何在插入符号中为 r 中的随机森林设置 ppv?
How to set a ppv in caret for random forest in r?
所以我有兴趣创建一个优化 PPV 的模型。我创建了一个 RF 模型(如下),它输出一个混淆矩阵,然后我手动计算灵敏度、特异性、ppv、npv 和 F1。我知道现在准确性得到了优化,但我愿意放弃灵敏度和特异性以获得更高的 ppv。
data_ctrl_null <- trainControl(method="cv", number = 5, classProbs = TRUE, summaryFunction=twoClassSummary, savePredictions=T, sampling=NULL)
set.seed(5368)
model_htn_df <- train(outcome ~ ., data=htn_df, ntree = 1000, tuneGrid = data.frame(mtry = 38), trControl = data_ctrl_null, method= "rf",
preProc=c("center","scale"),metric="ROC", importance=TRUE)
model_htn_df$finalModel #provides confusion matrix
结果:
Call:
randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 38
OOB estimate of error rate: 16.2%
Confusion matrix:
no yes class.error
no 274 19 0.06484642
yes 45 57 0.44117647
我的手动计算:sen = 55.9% spec = 93.5%, ppv = 75.0%, npv = 85.9%(混淆矩阵将我的no和yes作为结果切换,所以我在计算性能时也切换了数字指标。)
那么我需要做什么才能获得 PPV = 90%?
这是一个similar question,但我并没有真正理解它。
我们定义了一个函数来计算 PPV 和 return 结果的名称:
PPV <- function (data,lev = NULL,model = NULL) {
value <- posPredValue(data$pred,data$obs, positive = lev[1])
c(PPV=value)
}
假设我们有以下数据:
library(randomForest)
library(caret)
data=iris
data$Species = ifelse(data$Species == "versicolor","versi","others")
trn = sample(nrow(iris),100)
然后我们通过指定 PPV 为指标进行训练:
mdl <- train(Species ~ ., data = data[trn,],
method = "rf",
metric = "PPV",
trControl = trainControl(summaryFunction = PPV,
classProbs = TRUE))
Random Forest
100 samples
4 predictor
2 classes: 'others', 'versi'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
mtry PPV
2 0.9682811
3 0.9681759
4 0.9648426
PPV was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
现在你可以看到它是在 PPV 上训练的。但是你不能强迫训练达到0.9的PPV..这真的取决于数据,如果你的自变量没有预测能力,无论你训练多少它都不会提高对吗?
所以我有兴趣创建一个优化 PPV 的模型。我创建了一个 RF 模型(如下),它输出一个混淆矩阵,然后我手动计算灵敏度、特异性、ppv、npv 和 F1。我知道现在准确性得到了优化,但我愿意放弃灵敏度和特异性以获得更高的 ppv。
data_ctrl_null <- trainControl(method="cv", number = 5, classProbs = TRUE, summaryFunction=twoClassSummary, savePredictions=T, sampling=NULL)
set.seed(5368)
model_htn_df <- train(outcome ~ ., data=htn_df, ntree = 1000, tuneGrid = data.frame(mtry = 38), trControl = data_ctrl_null, method= "rf",
preProc=c("center","scale"),metric="ROC", importance=TRUE)
model_htn_df$finalModel #provides confusion matrix
结果:
Call:
randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 38
OOB estimate of error rate: 16.2%
Confusion matrix:
no yes class.error
no 274 19 0.06484642
yes 45 57 0.44117647
我的手动计算:sen = 55.9% spec = 93.5%, ppv = 75.0%, npv = 85.9%(混淆矩阵将我的no和yes作为结果切换,所以我在计算性能时也切换了数字指标。)
那么我需要做什么才能获得 PPV = 90%?
这是一个similar question,但我并没有真正理解它。
我们定义了一个函数来计算 PPV 和 return 结果的名称:
PPV <- function (data,lev = NULL,model = NULL) {
value <- posPredValue(data$pred,data$obs, positive = lev[1])
c(PPV=value)
}
假设我们有以下数据:
library(randomForest)
library(caret)
data=iris
data$Species = ifelse(data$Species == "versicolor","versi","others")
trn = sample(nrow(iris),100)
然后我们通过指定 PPV 为指标进行训练:
mdl <- train(Species ~ ., data = data[trn,],
method = "rf",
metric = "PPV",
trControl = trainControl(summaryFunction = PPV,
classProbs = TRUE))
Random Forest
100 samples
4 predictor
2 classes: 'others', 'versi'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
mtry PPV
2 0.9682811
3 0.9681759
4 0.9648426
PPV was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
现在你可以看到它是在 PPV 上训练的。但是你不能强迫训练达到0.9的PPV..这真的取决于数据,如果你的自变量没有预测能力,无论你训练多少它都不会提高对吗?