我如何在 R 中将 LOOCV 与 KNN 一起使用?

How can I use LOOCV in R with KNN?

我正在尝试将 KNN 与癌症数据结合使用。起初,我只是将分离数据用于训练和测试集,但得到了意想不到的结果。所以我想用LOOCV来确定一下。

我只找到了具有广义线性模型的 LOOCV。

例如glm.fit = glm(mpg ~ horsepower, data=Auto)

那么如何在 R 中将 LOOCV 与 KNN 一起使用呢?

编辑

我的代码

wdbc<- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",sep=",",stringsAsFactors = FALSE)

wdbc<-wdbc[-1]

normalize <- function(x) {return ((x-min(x)) / (max(x) - min(x)))}

wdbc_n <- as.data.frame(lapply(wdbc[2:31], normalize))
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]

我上传了数据并排除了第一列,即 class 标签。然后我将数据分成训练集和测试集。但是,我想在分离中使用 LOOCV 而不是我上面的分离。

The knn.cv function from class package is based on the leave one out cross validation. The below implementation of this function gives you a LOOCV prediction of the full data (i.e. no separation into train and test).

library(class)

knn.cv(train = wdbc_n, 
      cl = as.factor(wdbc[,1]), 
      k = 4, prob = FALSE,                        # test for different values of k
      use.all = TRUE)

参考 knn.cv: R documentation

knn中的一般概念是找到正确的k value(即最近邻的数量)用于预测。这是使用交叉验证完成的。

One better way would be to use the caret package to preform cv on a grid to get the optimal k value. Something like:

library(caret)

train.control <- trainControl(method  = "LOOCV")

fit <- train(V1~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:20),
             trControl  = train.control,
             metric     = "Accuracy",
             data       = cbind(V1 = as.factor(wdbc[,1]), wdbc_n))

输出:适合

        k-Nearest Neighbors 

569 samples
 30 predictor
  2 classes: 'B', 'M' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 568, 568, 568, 568, 568, 568, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   1  0.9525483  0.8987965
   2  0.9595782  0.9132927
   3  0.9701230  0.9355404
   4  0.9683656  0.9318146
  ........................
  13  0.9736380  0.9429032
  14  0.9718805  0.9391558
  15  0.9753954  0.9467613
  16  0.9683656  0.9314173
  17  0.9736380  0.9429032
  18  0.9630931  0.9197531
  19  0.9648506  0.9236488
  20  0.9630931  0.9197531

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 15.

qplot(fit$results$k,fit$results$Accuracy,geom = "line",
      xlab = "k", ylab = "Accuracy")