在选定的数据范围内创建数据分区,以将其输入 caret::train 函数以进行交叉验证

Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

我想为下面的数据框创建 jack-knife 数据分区,分区将在 caret::train 中使用(如 caret::groupKFold() 生成)。但是,要注意的是,我想将测试点限制为大于 16 天,同时使用这些数据的其余部分作为训练集。

df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
     Time = seq(1:20))

我想这样做的原因是我真正感兴趣的只是模型预测上限的好坏,因为这是感兴趣的区域。我觉得有一种方法可以使用 caret::groupKFold() 函数来执行此操作,但我不确定该怎么做。任何帮助将不胜感激。

每个 CV 折叠包含的内容示例:

TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)

TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)

TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)

TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)

TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)

尽管采用 caret::groupKFold 函数输出的格式,以便可以将折叠送入 caret::train 函数:

CVFolds <- caret::groupKFold(df$Time)
CVFolds

提前致谢!

我发现内置函数中的自定义折叠通常不够灵活。因此我通常使用 tidyverse 来制作它们。解决您的问题的一种方法是:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #use the row number as a column called id
  filter(Time > 15) %>% #filter Time as per your need
  split(.$Time)  %>% #split df to a list by Time
  map(~ .x %>% select(id)) #select row numbers for each list element

每次两行的示例:

df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
                 Time = rep(1:20, each = 2))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 15) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id)) -> test_folds

test_folds
#output
$`16`
  id
1 31
2 32

$`17`
  id
3 33
4 34

$`18`
  id
5 35
6 36

$`19`
  id
7 37
8 38

$`20`
   id
9  39
10 40

每次行数不等

df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
                 Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id))

$`2`
  id
1  6
2  7
3  8

$`3`
  id
4  9
5 10

现在您可以使用参数 indexOut.

trainControl 中定义这些保留折叠

编辑:要获得与 caret::groupKFold 类似的输出,可以:

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>%
        select(id) %>%
        unlist %>%
        unname) %>%
  unname