R caret:如何在训练前将单独的 pca 应用于不同的数据帧?

R caret: How do I apply separate pca to different dataframes before training?

我在 R 中使用插入符。我的最终目标是提交不同的数据帧以分离预处理 pca,然后将 PCA 组件放在一起进行岭回归训练。但是,请参阅下面的示例代码,其中在 preProcess 中应用 pca 与 outside/before train 函数时我没有得到相同的结果。

  1. 为什么我得到的结果不一样?
  2. 我如何以最佳方式实现我的主要目标?
#Sample data
s <- c(-0.412440717220306, -0.459911376237869, -0.234769582748413, -0.332282930612564, -0.486973077058792, -0.301480442285538, -0.181094691157341, -0.240918189287186, 0.0962697193026543, -0.119731709361076, -0.389783203601837, -0.217093095183372, -0.302948802709579, -0.406619131565094, 0.247409552335739, -0.406119048595428, 0.0574243739247322, -0.301231145858765, -0.229316398501396, -0.0620433799922466)
t <- c(0.20061232149601, 0.0536709427833557, 0.530373573303223, 0.523406386375427, 0.267315864562988, 0.413556098937988, 0.274257719516754,  0.275401413440704, 0.634453296661377, 0.145272701978683, 0.196711808443069, 0.332845687866211, 0.345706522464752, 0.444085538387299, 0.253269702196121, 0.231440827250481, -0.196317762136459, 0.49691703915596, 0.43754768371582, 0.0106721892952919)
u <- c(-0.565160751342773, 0.377725303173065,-0.273447960615158, -0.338064402341843, -0.59904420375824, -0.780133605003357,-0.508388638496399, -0.226167500019073, -0.257708549499512, -0.349863946437836,-0.443032741546631, -0.36387038230896, -0.455201774835587, -0.137616977095604,0.130770832300186, -0.420618057250977, -0.125859051942825, -0.382272869348526, -0.355217516422272, -0.0601325333118439)
v <- c(-0.45850995182991, -0.0105021595954895, -0.475157409906387, -0.325350821018219, -0.548444092273712, -0.562069535255432, -0.473256289958954, -0.492668628692627, -0.205974608659744, -0.266964733600616, -0.289298176765442, -0.615423858165741, -0.261823982000351, -0.472221553325653,  -0.684594392776489, -0.42777806520462, -0.240604877471924, -0.589631199836731, -0.782602787017822, -0.468854814767838)
w <- c(-0.886135756969452, -0.96577262878418,-0.755464434623718, -0.640497982501984, -0.849709093570709, -0.837802410125732, -0.659287571907043, -0.646972358226776, 0.0532735884189606, -0.646163880825043,-0.963890254497528, -0.91286826133728, -1.10484659671783, -0.596551716327667,  -0.371927708387375, -0.684276521205902, -0.55376398563385, -0.969008028507233, -0.956810772418976, -0.0229262933135033)

y <- c(9, 26, 30, 15, 25, 30, 30, 35, 35, 30, 21, 30,  9, 33, 31, 34, 29, 35, 25, 31)

#Sample data for procedure 1 and 2
df_test1 <- data.frame(s, t, u, v, w)
df_test2 <- df_test1

#PROCEDURE 1: preProcess (pca) applied WITHIN "train" function
library(caret)    
ytrain_df_test <- c(1:nrow(df_test1)) # number of observation that should be split in to the number of folds.
ntrain <- length(ytrain_df_test)    
# define folds
cv_folds <- createFolds(ytrain_df_test, k = 10, list = TRUE, returnTrain = TRUE) #, ...)
# define training control
train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final') #, ...)
#adding y
df_test1$y <- y
# train the model
set.seed(1)
model1 <- caret::train(y~., data=df_test1, trControl=train_control, method= 'ridge', preProcess = 'pca')
output1 <- list(model1, model1$pred, summary(model1$pred), cor.test(model1$pred$pred, model1$pred$obs))
names(output1) <- c("Model", "Model_pred", "Summary", "Correlation")
output1


#PROCEDURE 2: preProcess (pca) applied OUTSIDE/BEFORE "train" function
ytrain_df_test <- c(1:nrow(df_test2)) # number of observation that should be split in to the number of folds.
ntrain <- length(ytrain_df_test)    

df2 <- preProcess(df_test2, method="pca", thresh = 0.95)
df_test2 <- predict(df2, df_test2)
df_test2$y <- y
df_test2
# define folds
cv_folds <- createFolds(ytrain_df_test, k = 10, list = TRUE, returnTrain = TRUE) 
# define training control
train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final') 
# train the model
set.seed(1)
model2 <- caret::train(y~., data=df_test2, trControl=train_control, method= 'ridge') #, preProcess = 'pca')
model2
output2 <- list(model2, model2$pred, summary(model2$pred), cor.test(model2$pred$pred, model2$pred$obs))
names(output2) <- c("Model", "Model_pred", "Summary", "Correlation")
output2```

1. 当您在训练函数中执行预处理(pca)时:

  • pca 是 运行 在 CV 期间的每个训练集上,训练集被转换
  • 在每个转换后的训练集上估计了几个岭回归模型(基于定义的超参数搜索)。
  • 根据为每个训练集获得的 pca,转换适当的测试集
  • 所有拟合模型都在适当的转换测试集上进行评估

完成后,将使用在测试集上具有最佳平均性能的超参数构建最终模型:

  • 对整个训练集数据应用pca,得到变换后的训练数据。
  • 使用预先选择的超参数,在转换后的列车数据上建立岭回归模型

当您在训练函数之前执行预处理 (pca) 时,您会导致 data leakage 因为您使用来自 CV 测试折叠的信息来估计 pca 坐标。这会在 CV 期间导致乐观偏差,应避免。

2. 我不知道内置的插入符功能可以提供这种处理多个数据集的功能。 我相信这可以通过 mlr3pipelines. Especially this tutorial 方便地实现。

这是一个示例,说明如何将 iris 数据集拆分为两个数据集,对每个数据集应用缩放和 pca,合并转换后的列并拟合 rpart 模型。使用随机搜索调整保留的 PCA 组件的数量以及一个 rpart 超参数:

包:

library(mlr3pipelines)
library(visNetwork)
library(mlr3learners)
library(mlr3tuning)
library(mlr3)  
library(paradox)

定义一个pipeop select或命名为"slct1":

pos1 <- po("select", id = "slct1")

告诉它哪些列要 select:

pos1$param_set$values$selector <- selector_name(c("Sepal.Length", "Sepal.Width"))

告诉它获取特​​征后要做什么

pos1 %>>%
  mlr_pipeops$get("scale", id = "scale1") %>>%
  mlr_pipeops$get("pca", id = "pca1") -> pr1

定义一个pipeop select或命名为"slct2":

pos2 <- po("select", id = "slct2")

告诉它哪些列要 select:

pos2$param_set$values$selector <- selector_name(c("Petal.Length", "Petal.Width"))

告诉它获取特​​征后要做什么

pos2 %>>%
   mlr_pipeops$get("scale", id = "scale2") %>>%
   mlr_pipeops$get("pca", id = "pca2") -> pr2

合并两个输出:

piper <- gunion(list(pr1, pr2)) %>>%
  mlr_pipeops$get("featureunion")

并将它们输送给学习者:

graph <- piper %>>%
  mlr_pipeops$get("learner",
                  learner = mlr_learners$get("classif.rpart"))

让我们看看它的外观:

graph$plot(html = TRUE)

现在定义应该如何调整:

glrn <- GraphLearner$new(graph)

10 折 CV:

cv10 <- rsmp("cv", folds = 10)

调整为每个数据集保留的 PCA 维数以及 rpart 的复杂性参数:

ps <- ParamSet$new(list(
  ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
  ParamInt$new("pca1.rank.",  lower = 1, upper = 2),
  ParamInt$new("pca2.rank.",  lower = 1, upper = 2)
))

定义任务和调整:

task <- mlr_tasks$get("iris")

instance <- TuningInstance$new(
  task = task,
  learner = glrn,
  resampling = cv10,
  measures = msr("classif.ce"),
  param_set = ps,
  terminator = term("evals", n_evals = 20)
)

开始随机搜索:

tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result

也许这也可以通过 tidymodels 悬停来完成我还没有尝试过。

编辑:回答评论中的问题。

为了完全掌握 mlr3,我建议您阅读 book 以及每个附件包的教程。

在上面的示例中,为每个数据集保留的 PCA 维数与 cp 超参数联合调整。这是在这一行中定义的:

ps <- ParamSet$new(list(
  ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
  ParamInt$new("pca1.rank.",  lower = 1, upper = 2),
  ParamInt$new("pca2.rank.",  lower = 1, upper = 2)
)) 

因此对于 pca1,算法可以选择 1 或 2 个保留(我这样设置是因为每个数据集中只有两个特征)

如果您不想为了优化性能而调整维数,那么您可以这样定义 pipeop

pos1 %>>%
  mlr_pipeops$get("scale", id = "scale1") %>>%
  mlr_pipeops$get("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1

在这种情况下,您应该从参数集中省略它:

ps <- ParamSet$new(list(
  ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1)
))

据我所知,解释的方差目前无法调整,只能调整 pca 转换的保留维度数。

要更改预测类型,可以定义学习器:

learner <- mlr_pipeops$get("learner",
                            learner = mlr_learners$get("classif.rpart"))

并设置预测类型:

learner$learner$predict_type <- "prob"

然后创建图形:

graph <- piper %>>%
  learner

获取每个超参数组合的性能:

instance$archive(unnest = "params")

获取每个超参数组合的预测:

lapply(as.list(instance$archive(unnest = "params")[,"resample_result"])$resample_result,
       function(x) x$predictions())

获取最佳超参数组合的预测:

instance$best()$predictions()

如果您想要数据框的形式:

do.call(rbind,
        lapply(instance$best()$predictions(),
               function(x) data.frame(x$data$tab,
                                      x$data$prob)))

可能有一些辅助功能可以让这更容易我只是没有玩够。