将数据集拆分为列表并应用 lm 模型 R

Question

我正在尝试使用 caret 包在我的数据集中应用 lm 模型。

可重现的例子：

df <- data.frame(x = 1:10000, y = sample(1:1000, 10000, replace = TRUE), group = sample(c('A', 'B', 'C'), 10000, replace = TRUE, prob = c(.1, .5, .4)))

df_list <- split(df, df$group)

df_list <- lapply(df_list, function(x) select(x, -group))

创建数据分区时出错。我想使用 caret 的 createDataPartition 对数据进行分区，然后应用 train 函数。

train_test <- lapply(df_list, function(x) createDataPartition(x, p = .8, list = FALSE))

model_list <- lapply(train_test, function(z) train(x ~ ., z, method = 'lm', trControl = trainControl(method = 'cv', number = 10, verboseIter = TRUE), preProcess = c('nzv', 'center', 'scale'))

我认为这是解决列表结构的一个简单问题，但出于某种原因我遇到了问题。感谢您的帮助！

Answer 1

createDataPartition 接受向量，而不是数据框：

train_test <- lapply(df_list, function(x) createDataPartition(x$y, p = .8, list = FALSE))

Answer 2

我认为分区错误是由于 createDataPartition 需要矢量而不是数据框。我认为你可以做到：

train_test <- lapply(df_list, function(x) {
  x[createDataPartition(x$x, p = 0.8, list = FALSE),]
})

然后你的 model_list <- ... 块对我有用。

据我所知，这不会扰乱您的索引：

set.seed(123)
df_small <- data.frame(x = runif(10), y = letters[1:10])
df_small_part <- df_small[createDataPartition(df_small$x, list = FALSE),]

> join(df_small, df_small_part, type = "left", by = "y")
           x y         x
1  0.2875775 a 0.2875775
2  0.7883051 b        NA
3  0.4089769 c        NA
4  0.8830174 d 0.8830174
5  0.9404673 e 0.9404673
6  0.0455565 f 0.0455565
7  0.5281055 g        NA
8  0.8924190 h        NA
9  0.5514350 i 0.5514350
10 0.4566147 j 0.4566147

Answer 3

如果您在控制台中键入 ?createDataPartition，您可以看到函数的正确用法。

也就是说，它的通用格式如下：

createDataPartition(y, times = 1, p = 0.5, list = TRUE, groups = min(5,
  length(y)))

其中 y 是 "a vector of outcomes"。它特别需要结果的原因是训练和测试拆分对于结果变量是平衡的（我假设在你的情况下是 y）。

因此，您拥有的不是以下代码：

train_test <- lapply(df_list, function(x) createDataPartition(x, p = .8, list = FALSE))

将其替换为以下内容：

train_test <- lapply(df_list, function(x) { 
  return(createDataPartition(x$y, p = .8, list = FALSE))
  })

明确地说，唯一的修改是添加 $y。

但是，这会导致最后一行出现另一个错误（lapply() train() 函数所在的行）。你看，createDataPartition() returns 返回索引用于您的数据框。换句话说，要获得 df_list 中每个 df 的训练集，您必须使用例如 (df_list[[1]])[train_test[[1]],]。随后，要获得相应的测试集，您必须使用例如(df_list[[1]])[-train_test[[1]],]（注意减号）。因此，您应该将最后一行重写为以下内容：

model_list <- purrr::map2(df_list, train_test, 
                          function(df, train_index)  {
                            train(x ~ ., df[train_index,], 
                                  method = 'lm', 
                                  trControl = trainControl(method = 'cv', 
                                                           number = 10, 
                                                           verboseIter = TRUE), 
                                  preProcess = c('nzv', 'center', 'scale')) 
                            })

请注意，purrr 的 map2 函数类似于 sapply/lapply（其中 sapply/lapply 为列表中的每个元素调用一个函数）。唯一的区别是 map2 迭代 2 列表（df_list 和 train_test）。

希望对您有所帮助！

编辑：如果您想了解有关插入符号包的更多信息，我推荐以下 link：http://topepo.github.io/caret/data-splitting.html

Answer 4

这是一个 purrr 列表列 tidyverse 兼容的 Jenny Bryan 启发的解决方案。请提供您的意见，您将如何使其更清洁。

library(dplyr)
library(tidyr)
library(purrr)

df <- data.frame(x = 1:10000, y = sample(1:1000, 10000, replace = TRUE), 
                 group = sample(c('A', 'B', 'C'), 10000, replace = TRUE, prob = c(.1, .5, .4)))

df %>% group_by(group) %>% nest() %>% 
  mutate(dataPart = map(data, ~caret::createDataPartition(.x$x, p = .8, list = FALSE) )) %>% 
  mutate(model_list = map2(data, dataPart, ~caret::train(x ~ ., 
                                      data=.x[.y,], 
                                      method = 'lm', 
                                      trControl = caret::trainControl(method = 'cv', number = 10, verboseIter = TRUE), 
                                      preProcess = c('nzv', 'center', 'scale'))),
         oof_prediction=pmap(list(data, dataPart, model_list), ~caret::predict.train(..3, newdata=..1[-..2, ])),
         oof_error=pmap(list(data, dataPart, oof_prediction), ~caret::postResample(..3, ..1$x[-..2])),
         oof_error=map(oof_error, ~as.data.frame(t(.x)))) %>% 
  unnest(oof_error)

What happens in data.frame, stays in data.frame - Hadley Wickham

# A tibble: 3 x 7
   group                 data          dataPart  model_list oof_prediction     RMSE     Rsquared
  <fctr>               <list>            <list>      <list>         <list>    <dbl>        <dbl>
1      C <tibble [3,971 x 2]> <int [3,179 x 1]> <S3: train>    <dbl [792]> 2902.691 2.386907e-05
2      B <tibble [5,041 x 2]> <int [4,033 x 1]> <S3: train>  <dbl [1,008]> 2832.764 3.075320e-04
3      A   <tibble [988 x 2]>   <int [792 x 1]> <S3: train>    <dbl [196]> 2861.664 3.438135e-03

将数据集拆分为列表并应用 lm 模型 R

splitting dataset into list and lapplying lm model R

split

r

lapply

lm