我无法在 R 中使用 One Class 生成分类的混淆矩阵
I am not able to generate the confusion matrix of a classification with One Class in R
我正在尝试在 Kaggle 的数据集上理解和实施 Class ClassR 中的一个化(https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)。
尝试打印混淆矩阵时出现错误:
Error in! All.equal (nrow (data), ncol (data)): invalid type argument
我做错了什么?
library(caret)
library(dplyr)
library(e1071)
library(NLP)
library(tm)
library(data.table)
ds = read.csv('C:/Users/hugos/Desktop/FS Dataset/Health/data_cancer.csv',
header = TRUE)
mycols <- c("id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean",
"smoothness_mean","compactness_mean","concavity_mean",
"concave.points_mean","symmetry_mean","fractal_dimension_mean",
"radius_se","texture_se","perimeter_se",
"area_se","smoothness_se","compactness_se",
"concavity_se","concave.points_se","symmetry_se",
"fractal_dimension_se","radius_worst","texture_worst",
"perimeter_worst","area_worst","smoothness_worst",
"compactness_worst","concavity_worst","concave.points_worst",
"symmetry_worst","fractal_dimension_worst")
#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]
#Convert classification to logical
data <- ds[,.(id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave.points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave.points_worst,symmetry_worst,fractal_dimension_worst,diagnosis = ds$diagnosis == "TRUE")]
dataclean <- na.omit(data)
#Separating train and test
inTrain<-createDataPartition(1:nrow(dataclean),p=0.7,list=FALSE)
train<- dataclean[inTrain]
test <- dataclean[-inTrain]
svm.model<-svm(diagnosis ~ id+radius_mean+texture_mean+perimeter_mean+area_mean+smoothness_mean+compactness_mean+concavity_mean+concave.points_mean+symmetry_mean+fractal_dimension_mean+radius_se+texture_se+perimeter_se+area_se+smoothness_se+compactness_se+concavity_se+concave.points_se+symmetry_se+fractal_dimension_se+radius_worst+texture_worst+perimeter_worst+area_worst+smoothness_worst+compactness_worst+concavity_worst+concave.points_worst+symmetry_worst+fractal_dimension_worst, data = train,
type='one-classification',
trControl = fitControl,
nu=0.10,
scale=TRUE,
kernel="radial",
metric = "ROC")
#Perform predictions
svm.predtrain<-predict(svm.model,train)
svm.predtest<-predict(svm.model,test)
confTrain <- table(Predicted=svm.predtrain,
Reference=train$diagnosis[as.integer(names(svm.predtrain))])
confTest <- table(Predicted=svm.predtest,
Reference=test$diagnosis[as.integer(names(svm.predtest))])
confusionMatrix(confTest,positive='TRUE')
print(confTrain)
print(confTest)
您的问题在这一行:
#Convert classification to logical
data <- ds[, .(id, radius_mean, ..., diagnosis = ds$diagnosis == "TRUE")]
我假设您使用的是 R 4.0 版,因为 read.csv
函数的默认行为现在 不 将字符列转换为因子。这个命令:
#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]
然后会将所有诊断转换为 NA,因为它们 "M" 或 "B" 分别代表恶性和良性。
因此,请确保在导入数据时将字符串转换为因子。
ds = read.csv('.../data_cancer.csv', header = TRUE, stringsAsFactors = TRUE)
str(ds)
'data.frame': 569 obs. of 33 variables:
$ id : int 842302 842517 84300903 84348301 84358402 843786 844359 ...
$ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
我想有些人需要一段时间才能习惯 R 的这种新行为。
您将分类转换为逻辑的命令应该是:
data <- ds[, .(id, radius_mean, ..., diagnosis = diagnosis == 2)] # or == 1 ?
这将使您剩下的所有命令都起作用。
confusionMatrix(confTest, positive='TRUE')
Confusion Matrix and Statistics
Reference
Predicted FALSE TRUE
FALSE 10 8 # Note these numbers may change
TRUE 100 50
Accuracy : 0.3571
95% CI : (0.2848, 0.4346)
No Information Rate : 0.6548
P-Value [Acc > NIR] : 1
Kappa : -0.0342
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.86207
Specificity : 0.09091
Pos Pred Value : 0.33333
Neg Pred Value : 0.55556
Prevalence : 0.34524
Detection Rate : 0.29762
Detection Prevalence : 0.89286
Balanced Accuracy : 0.47649
'Positive' Class : TRUE
我正在尝试在 Kaggle 的数据集上理解和实施 Class ClassR 中的一个化(https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)。
尝试打印混淆矩阵时出现错误:
Error in! All.equal (nrow (data), ncol (data)): invalid type argument
我做错了什么?
library(caret)
library(dplyr)
library(e1071)
library(NLP)
library(tm)
library(data.table)
ds = read.csv('C:/Users/hugos/Desktop/FS Dataset/Health/data_cancer.csv',
header = TRUE)
mycols <- c("id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean",
"smoothness_mean","compactness_mean","concavity_mean",
"concave.points_mean","symmetry_mean","fractal_dimension_mean",
"radius_se","texture_se","perimeter_se",
"area_se","smoothness_se","compactness_se",
"concavity_se","concave.points_se","symmetry_se",
"fractal_dimension_se","radius_worst","texture_worst",
"perimeter_worst","area_worst","smoothness_worst",
"compactness_worst","concavity_worst","concave.points_worst",
"symmetry_worst","fractal_dimension_worst")
#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]
#Convert classification to logical
data <- ds[,.(id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave.points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave.points_worst,symmetry_worst,fractal_dimension_worst,diagnosis = ds$diagnosis == "TRUE")]
dataclean <- na.omit(data)
#Separating train and test
inTrain<-createDataPartition(1:nrow(dataclean),p=0.7,list=FALSE)
train<- dataclean[inTrain]
test <- dataclean[-inTrain]
svm.model<-svm(diagnosis ~ id+radius_mean+texture_mean+perimeter_mean+area_mean+smoothness_mean+compactness_mean+concavity_mean+concave.points_mean+symmetry_mean+fractal_dimension_mean+radius_se+texture_se+perimeter_se+area_se+smoothness_se+compactness_se+concavity_se+concave.points_se+symmetry_se+fractal_dimension_se+radius_worst+texture_worst+perimeter_worst+area_worst+smoothness_worst+compactness_worst+concavity_worst+concave.points_worst+symmetry_worst+fractal_dimension_worst, data = train,
type='one-classification',
trControl = fitControl,
nu=0.10,
scale=TRUE,
kernel="radial",
metric = "ROC")
#Perform predictions
svm.predtrain<-predict(svm.model,train)
svm.predtest<-predict(svm.model,test)
confTrain <- table(Predicted=svm.predtrain,
Reference=train$diagnosis[as.integer(names(svm.predtrain))])
confTest <- table(Predicted=svm.predtest,
Reference=test$diagnosis[as.integer(names(svm.predtest))])
confusionMatrix(confTest,positive='TRUE')
print(confTrain)
print(confTest)
您的问题在这一行:
#Convert classification to logical
data <- ds[, .(id, radius_mean, ..., diagnosis = ds$diagnosis == "TRUE")]
我假设您使用的是 R 4.0 版,因为 read.csv
函数的默认行为现在 不 将字符列转换为因子。这个命令:
#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]
然后会将所有诊断转换为 NA,因为它们 "M" 或 "B" 分别代表恶性和良性。
因此,请确保在导入数据时将字符串转换为因子。
ds = read.csv('.../data_cancer.csv', header = TRUE, stringsAsFactors = TRUE)
str(ds)
'data.frame': 569 obs. of 33 variables:
$ id : int 842302 842517 84300903 84348301 84358402 843786 844359 ...
$ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
我想有些人需要一段时间才能习惯 R 的这种新行为。 您将分类转换为逻辑的命令应该是:
data <- ds[, .(id, radius_mean, ..., diagnosis = diagnosis == 2)] # or == 1 ?
这将使您剩下的所有命令都起作用。
confusionMatrix(confTest, positive='TRUE')
Confusion Matrix and Statistics
Reference
Predicted FALSE TRUE
FALSE 10 8 # Note these numbers may change
TRUE 100 50
Accuracy : 0.3571
95% CI : (0.2848, 0.4346)
No Information Rate : 0.6548
P-Value [Acc > NIR] : 1
Kappa : -0.0342
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.86207
Specificity : 0.09091
Pos Pred Value : 0.33333
Neg Pred Value : 0.55556
Prevalence : 0.34524
Detection Rate : 0.29762
Detection Prevalence : 0.89286
Balanced Accuracy : 0.47649
'Positive' Class : TRUE