SVM 性能与 AUC 分数不一致
SVM performance not consistent with AUC score
我有一个包含患者信息的数据集。它包括几个变量及其临床状态(如果健康则为 0,如果生病则为 1)。
我已尝试实施 SVM 模型以根据这些变量预测患者状态。
library(e1071)
Index <-
order(Ytrain, decreasing = FALSE)
SVMfit_Var <-
svm(Xtrain[Index, ], Ytrain[Index],
type = "C-classification", gamma = 0.005, probability = TRUE, cost = 0.001, epsilon = 0.1)
preds1 <-
predict(SVMfit_Var, Xtest, probability = TRUE)
preds1 <-
attr(preds1, "probabilities")[,1]
samples <- !is.na(Ytest)
pred <- prediction(preds1[samples],Ytest[samples])
AUC<-performance(pred,"auc")@y.values[[1]]
prediction <- predict(SVMfit_Var, Xtest)
xtab <- table(Ytest, prediction)
为了测试模型的性能,我计算了 ROC AUC,并通过验证集获得了 AUC = 0.997。
但是当我查看预测时,所有患者都被指定为健康。
AUC = 0.997
> xtab
prediction
Ytest 0 1
0 72 0
1 52 0
谁能帮我解决这个问题?
您是否查看了概率与拟合值? You can read about how probability works with SVM here.
如果您想查看性能,可以使用库 DescTools
和函数 Conf
或库 caret
和函数 confusionMatrix
。 (它们提供相同的输出。)
library(DescTools)
library(caret)
# for the training performance with DescTools
Conf(table(SVMfit_Var$fitted, Ytrain[Index]))
# svm.model$fitted, y-values for training
# training performance with caret
confusionMatrix(SVMfit_Var$fitted, as.factor(Ytrain[Index]))
# svm.model$fitted, y-values
# if y.values aren't factors, use as.factor()
# for testing performance with DescTools
# with `table()` in your question, you must flip the order:
# predicted first, then actual values
Conf(table(prediction, Ytest))
# and for caret
confusionMatrix(prediction, as.factor(Ytest))
您的问题不可重现,因此我使用 iris
数据进行了分析。每次观察的概率都相同。我包括了这个,所以你可以用另一个数据集看到这个。
library(e1071)
library(ROCR)
library(caret)
data("iris")
# make it binary
df1 <- iris %>% filter(Species != "setosa") %>% droplevels()
# check the subset
summary(df1)
set.seed(395) # keep the sample repeatable
tr <- sample(1:nrow(df1), size = 70, # 70%
replace = F)
# create the model
svm.fit <- svm(df1[tr, -5], df1[tr, ]$Species,
type = "C-classification",
gamma = .005, probability = T,
cost = .001, epsilon = .1)
# look at probabilities
pb.fit <- predict(svm.fit, df1[-tr, -5], probability = T)
# this shows EVERY row has the same outcome probability distro
pb.fit <- attr(pb.fit, "probabilities")[,1]
# look at performance
performance(prediction(pb.fit, df1[-tr, ]$Species), "auc")@y.values[[1]]
# [1] 0.03555556 that's abysmal!!
# test the model
p.fit = predict(svm.fit, df1[-tr, -5])
confusionMatrix(p.fit, df1[-tr, ]$Species)
# 93% accuracy with NIR at 50%... the AUC score was not useful
# check the trained model performance
confusionMatrix(svm.fit$fitted, df1[tr, ]$Species)
# 87%, with NIR at 50%... that's really good
我有一个包含患者信息的数据集。它包括几个变量及其临床状态(如果健康则为 0,如果生病则为 1)。 我已尝试实施 SVM 模型以根据这些变量预测患者状态。
library(e1071)
Index <-
order(Ytrain, decreasing = FALSE)
SVMfit_Var <-
svm(Xtrain[Index, ], Ytrain[Index],
type = "C-classification", gamma = 0.005, probability = TRUE, cost = 0.001, epsilon = 0.1)
preds1 <-
predict(SVMfit_Var, Xtest, probability = TRUE)
preds1 <-
attr(preds1, "probabilities")[,1]
samples <- !is.na(Ytest)
pred <- prediction(preds1[samples],Ytest[samples])
AUC<-performance(pred,"auc")@y.values[[1]]
prediction <- predict(SVMfit_Var, Xtest)
xtab <- table(Ytest, prediction)
为了测试模型的性能,我计算了 ROC AUC,并通过验证集获得了 AUC = 0.997。 但是当我查看预测时,所有患者都被指定为健康。
AUC = 0.997
> xtab
prediction
Ytest 0 1
0 72 0
1 52 0
谁能帮我解决这个问题?
您是否查看了概率与拟合值? You can read about how probability works with SVM here.
如果您想查看性能,可以使用库 DescTools
和函数 Conf
或库 caret
和函数 confusionMatrix
。 (它们提供相同的输出。)
library(DescTools)
library(caret)
# for the training performance with DescTools
Conf(table(SVMfit_Var$fitted, Ytrain[Index]))
# svm.model$fitted, y-values for training
# training performance with caret
confusionMatrix(SVMfit_Var$fitted, as.factor(Ytrain[Index]))
# svm.model$fitted, y-values
# if y.values aren't factors, use as.factor()
# for testing performance with DescTools
# with `table()` in your question, you must flip the order:
# predicted first, then actual values
Conf(table(prediction, Ytest))
# and for caret
confusionMatrix(prediction, as.factor(Ytest))
您的问题不可重现,因此我使用 iris
数据进行了分析。每次观察的概率都相同。我包括了这个,所以你可以用另一个数据集看到这个。
library(e1071)
library(ROCR)
library(caret)
data("iris")
# make it binary
df1 <- iris %>% filter(Species != "setosa") %>% droplevels()
# check the subset
summary(df1)
set.seed(395) # keep the sample repeatable
tr <- sample(1:nrow(df1), size = 70, # 70%
replace = F)
# create the model
svm.fit <- svm(df1[tr, -5], df1[tr, ]$Species,
type = "C-classification",
gamma = .005, probability = T,
cost = .001, epsilon = .1)
# look at probabilities
pb.fit <- predict(svm.fit, df1[-tr, -5], probability = T)
# this shows EVERY row has the same outcome probability distro
pb.fit <- attr(pb.fit, "probabilities")[,1]
# look at performance
performance(prediction(pb.fit, df1[-tr, ]$Species), "auc")@y.values[[1]]
# [1] 0.03555556 that's abysmal!!
# test the model
p.fit = predict(svm.fit, df1[-tr, -5])
confusionMatrix(p.fit, df1[-tr, ]$Species)
# 93% accuracy with NIR at 50%... the AUC score was not useful
# check the trained model performance
confusionMatrix(svm.fit$fitted, df1[tr, ]$Species)
# 87%, with NIR at 50%... that's really good