如何使用 knn classification（class 包）使用训练和测试数据集

Question

Dfcensus是原始数据框。我正在尝试使用 Sex、EducYears 和 Age 来预测一个人的 Income 是 "<=50K" 还是 ">50K"。

x_train_auto（训练集）有 20,000 行，x_test_auto（测试集）有 12,561 行。

我的分类变量（训练集）有 15,124 <=50k 和 4876 >50k。

这是我的代码：

predictions = knn(train = x_train_auto, # response
                  test  = x_test_auto, # response
                  cl = Df_census$Income[in_train_census], # prediction
                  k = 25)

table(predictions)
#<=50K 
#12561

如您所见，预测所有 12,561 个测试样本的 Income 为“>=50K”。

这没有意义。我不确定我哪里错了。

P.S.: 我将 sex one-hot 编码为 0 代表男性，1 代表女性。我对 Educ_years 和 Age 进行了缩放，并向数据框添加了性别。然后，我将 one-hot 编码的性别变量添加回缩放测试和训练数据中。

Answer 1

确定问题

您提供的 x_test-auto.csv 数据表明您传递了具有 TRUE 和 FALSE 的逻辑向量（定义训练的指数和测试样本而不是实际数据）到 class::knn.

的 train 和 test 参数

解决方法

而是使用 x_train_auto 中的逻辑向量（我相信它对应于您的示例中的 in_train_census）来定义两个单独的 data.frame，每个都包含您想要的所有预测变量。然后是 training 和 test 集。

p <- c("Age","EducYears","Sex")
Df_train <- Df_census[in_train_census,p]
Df_test <- Df_census[!in_train_census,p]

在knn函数中，将训练集传给train参数，测试集 到 test 参数，并进一步将训练集的结果/目标变量（作为因子）传递给 cl.

输出（参见 ?class::knn）将是 测试集 的预测结果。

这是一个使用您的数据的完整且可重现的工作流程。

数据

library(class)

# read data from Dropbox
x_train_auto <- read.csv("https://dropbox.com/s/6kupkp4u4qyizy7/x_test_auto.csv?dl=1", row.names = 1)
Df_census <- read.csv("https://dropbox.com/s/ccvck8ajnatmpv0/Df_census.csv?dl=1", row.names = 1, stringsAsFactors = TRUE)

table(x_train_auto) # TRUE are training, FALSE are test set
#> x_train_auto
#> FALSE  TRUE 
#> 12561 20000
str(Df_census) # Income as factor, Sex is binary, Age and EducYears are numeric
#> 'data.frame':    32561 obs. of  15 variables:
#>  $ Age          : int  39 50 38 53 28 37 49 52 31 42 ...
#>  $ Work         : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#>  $ Fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
#>  $ Education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
#>  $ EducYears    : int  13 13 9 7 13 14 5 9 14 13 ...
#>  $ MaritalStatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
#>  $ Occupation   : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
#>  $ Relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
#>  $ Race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
#>  $ Sex          : int  1 1 1 1 0 0 0 1 0 1 ...
#>  $ CapitalGain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
#>  $ CapitalLoss  : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ HoursPerWeek : int  40 13 40 40 40 40 16 45 50 40 ...
#>  $ NativeCountry: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
#>  $ Income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

# predictors and response
p <- c("Age","EducYears","Sex")
y <- "Income"

# create data partition
in_train_census <- x_train_auto$x

Df_train <- Df_census[in_train_census,]
Df_test <- Df_census[!in_train_census,]

# check
dim(Df_train)
#> [1] 20000    15

dim(Df_test)
#> [1] 12561    15

table(Df_train$Income)
#> 
#> <=50K  >50K 
#> 15124  4876

使用class::knn

knn（k-最近邻）算法的性能好坏取决于超参数k的选择。通常很难知道哪个 k 值最适合特定数据集的 class 化。在机器学习设置中，您想要尝试 k 的不同值，以找到在 测试数据集 上提供最高性能的值（即，数据是不用于模型拟合）。

在过度拟合之间取得良好的平衡总是很重要的（模型太复杂，在训练数据上会给出好的结果，但在新的数据上会不太准确甚至垃圾结果测试数据）和欠拟合（模型太简单，无法解释数据中的实际模式）。根据 here.

的解释，在 knn 的情况下，使用 更大的 k 值 可能会更好地防止过度拟合

# apply knn for various k using the given training / test set
r <- data.frame(array(NA, dim = c(0, 2), dimnames = list(NULL, c("k","accuracy"))))

for (k in 1:30) {
  
  #cat("k =", k, "\n")
  
  # fit model on training set, predict test set data
  set.seed(60402) # to be reproducible
  predictions <- knn(train = Df_train[,p],
                     test = Df_test[,p],
                     cl = Df_train[,y],
                     k = k)
  
  # confusion matrix on test set
  t <- table(pred = predictions, ref = Df_test[,y])
  
  # accuracy
  a <- sum(diag(t)) / sum(t)
  
  # bind
  r <- rbind(r, data.frame(k = k, accuracy = a))
}

可视化模型评估

# find best k
r[which.max(r$accuracy),]
#>     k  accuracy
#> 17 17 0.8007324

(k.best <- r[which.max(r$accuracy),"k"])
#> [1] 17

# plot
with(r, plot(k, accuracy, type = "l"))
abline(v = k.best, lty = 2)

^{由 reprex package (v2.0.1)}

于 2021-09-23 创建

解读

循环结果表明您的最佳值k对于这个特定的训练和测试集是在 12 和 17 之间（见上图），但与使用 k = 1 相比，精度增益非常小（它在 80% 左右，无论k).

补充想法

鉴于与低收入相比高收入更为罕见，准确度可能不是理想的绩效指标。 Sensitivity 可能同等或更重要，您可以修改示例代码来计算和评估其他性能指标。

除了纯预测之外，您可能还想通过将其他变量添加到 p 向量并比较结果准确度。

在这里，我们的结论基于训练和测试数据的特定实现。更好的机器学习实践是将您的数据分成 2 份（如此处所示），然后使用例如（重复）k-fold cross validation. A good package to do this in R is e.g. caret or tidymodels.

为了更好地了解哪些变量是收入的最佳预测因子 class，我还将对各种不相关的预测因子进行 逻辑回归。

如何使用 knn classification（class 包）使用训练和测试数据集

How to use knn classification (class package) using training and test datasets

r

classification

knn