在选定的数据范围内创建数据分区,以将其输入 caret::train 函数以进行交叉验证
Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation
我想为下面的数据框创建 jack-knife 数据分区,分区将在 caret::train
中使用(如 caret::groupKFold()
生成)。但是,要注意的是,我想将测试点限制为大于 16 天,同时使用这些数据的其余部分作为训练集。
df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
Time = seq(1:20))
我想这样做的原因是我真正感兴趣的只是模型预测上限的好坏,因为这是感兴趣的区域。我觉得有一种方法可以使用 caret::groupKFold()
函数来执行此操作,但我不确定该怎么做。任何帮助将不胜感激。
每个 CV 折叠包含的内容示例:
TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)
TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)
TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)
TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)
TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)
尽管采用 caret::groupKFold
函数输出的格式,以便可以将折叠送入 caret::train
函数:
CVFolds <- caret::groupKFold(df$Time)
CVFolds
提前致谢!
我发现内置函数中的自定义折叠通常不够灵活。因此我通常使用 tidyverse
来制作它们。解决您的问题的一种方法是:
library(tidyverse)
df %>%
mutate(id = row_number()) %>% #use the row number as a column called id
filter(Time > 15) %>% #filter Time as per your need
split(.$Time) %>% #split df to a list by Time
map(~ .x %>% select(id)) #select row numbers for each list element
每次两行的示例:
df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
Time = rep(1:20, each = 2))
df %>%
mutate(id = row_number()) %>%
filter(Time > 15) %>%
split(.$Time) %>%
map(~ .x %>% select(id)) -> test_folds
test_folds
#output
$`16`
id
1 31
2 32
$`17`
id
3 33
4 34
$`18`
id
5 35
6 36
$`19`
id
7 37
8 38
$`20`
id
9 39
10 40
每次行数不等
df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>% select(id))
$`2`
id
1 6
2 7
3 8
$`3`
id
4 9
5 10
现在您可以使用参数 indexOut
.
在 trainControl
中定义这些保留折叠
编辑:要获得与 caret::groupKFold
类似的输出,可以:
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>%
select(id) %>%
unlist %>%
unname) %>%
unname
我想为下面的数据框创建 jack-knife 数据分区,分区将在 caret::train
中使用(如 caret::groupKFold()
生成)。但是,要注意的是,我想将测试点限制为大于 16 天,同时使用这些数据的其余部分作为训练集。
df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
Time = seq(1:20))
我想这样做的原因是我真正感兴趣的只是模型预测上限的好坏,因为这是感兴趣的区域。我觉得有一种方法可以使用 caret::groupKFold()
函数来执行此操作,但我不确定该怎么做。任何帮助将不胜感激。
每个 CV 折叠包含的内容示例:
TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)
TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)
TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)
TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)
TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)
尽管采用 caret::groupKFold
函数输出的格式,以便可以将折叠送入 caret::train
函数:
CVFolds <- caret::groupKFold(df$Time)
CVFolds
提前致谢!
我发现内置函数中的自定义折叠通常不够灵活。因此我通常使用 tidyverse
来制作它们。解决您的问题的一种方法是:
library(tidyverse)
df %>%
mutate(id = row_number()) %>% #use the row number as a column called id
filter(Time > 15) %>% #filter Time as per your need
split(.$Time) %>% #split df to a list by Time
map(~ .x %>% select(id)) #select row numbers for each list element
每次两行的示例:
df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
Time = rep(1:20, each = 2))
df %>%
mutate(id = row_number()) %>%
filter(Time > 15) %>%
split(.$Time) %>%
map(~ .x %>% select(id)) -> test_folds
test_folds
#output
$`16`
id
1 31
2 32
$`17`
id
3 33
4 34
$`18`
id
5 35
6 36
$`19`
id
7 37
8 38
$`20`
id
9 39
10 40
每次行数不等
df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>% select(id))
$`2`
id
1 6
2 7
3 8
$`3`
id
4 9
5 10
现在您可以使用参数 indexOut
.
trainControl
中定义这些保留折叠
编辑:要获得与 caret::groupKFold
类似的输出,可以:
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>%
select(id) %>%
unlist %>%
unname) %>%
unname