使用 choose(k,n) 将数据拆分为 5 个子集，而不使用 sample()

Question

我想拆分 train 和 test 但使用 choose() 函数 not with sample() in R.

我的数据集（一个 csv 文件）有 58 行和 28 列，我想在这个数据集上做一个 10 倍或 5 倍的 CV。

我要如何为这个任务写下代码？

我试过：

set.seed(1)
smp_size=choose(58,5, name_dataset) # which is totally wrong but ... 
# I haven't figured out yet how to take 5 subsets from 58 observations
# each time I do a 5/10 -fold  CV

train_ind=sample(seq_len(nrow(name_dataset)),size=smp_size) # I think sample here is wrong too
train=name_dataset[train_ind,]
test=name_dataset[-train_ind,]

Answer 1

我不知道你所说的 5 子集的所有可能组合是什么意思。这似乎是一个令人难以置信的大量可能性。我假设您的意思是您想要包含数据集中所有样本的 5 个数据集的子集。我可能会做这样的事情。我们首先制作一个组向量，它是 k 的数量和数据集的长度。然后我们随机抽取这些组并按这些分组拆分数据集。

library(tidyverse)

set.seed(3465)
test_data <- tibble(A = runif(58),
                    B = runif(58))


k_split <- function(dat,k, seed = 1){
  set.seed(seed)
  grp <- rep(1:k, length.out = nrow(dat))
  dat |>
    mutate(grp = sample(grp, nrow(dat), replace = F)) |>
    group_split(grp)|>
    map(\(d) select(d, -grp))
}

k_split(test_data, 5)
#> [[1]]
#> # A tibble: 12 x 2
#>        A      B
#>    <dbl>  <dbl>
#>  1 0.476 0.468 
#>  2 0.636 0.639 
#>  3 0.334 0.0269
#>  4 0.668 0.220 
#>  5 0.398 0.919 
#>  6 0.343 0.748 
#>  7 0.799 0.526 
#>  8 0.710 0.759 
#>  9 0.737 0.927 
#> 10 0.819 0.441 
#> 11 0.852 0.656 
#> 12 0.416 0.541 
#> 
#> [[2]]
#> # A tibble: 12 x 2
#>         A      B
#>     <dbl>  <dbl>
#>  1 0.0107 0.905 
#>  2 0.109  0.539 
#>  3 0.715  0.778 
#>  4 0.523  0.416 
#>  5 0.609  0.357 
#>  6 0.152  0.0972
#>  7 0.919  0.450 
#>  8 0.866  0.510 
#>  9 0.0347 0.0890
#> 10 0.862  0.465 
#> 11 0.364  0.765 
#> 12 0.789  0.601 
#> 
#> [[3]]
#> # A tibble: 12 x 2
#>         A      B
#>     <dbl>  <dbl>
#>  1 0.580  0.228 
#>  2 0.201  0.0418
#>  3 0.0359 0.417 
#>  4 0.521  0.758 
#>  5 0.534  0.974 
#>  6 0.580  0.563 
#>  7 0.844  0.781 
#>  8 0.756  0.271 
#>  9 0.211  0.533 
#> 10 0.851  0.764 
#> 11 0.885  0.150 
#> 12 0.262  0.371 
#> 
#> [[4]]
#> # A tibble: 11 x 2
#>         A     B
#>     <dbl> <dbl>
#>  1 0.556  0.313
#>  2 0.353  0.821
#>  3 0.0959 0.861
#>  4 0.759  0.261
#>  5 0.207  0.772
#>  6 0.668  0.527
#>  7 0.150  0.788
#>  8 0.0939 0.257
#>  9 0.0913 0.817
#> 10 0.294  0.790
#> 11 0.0224 0.253
#> 
#> [[5]]
#> # A tibble: 11 x 2
#>          A      B
#>      <dbl>  <dbl>
#>  1 0.0893  0.665 
#>  2 0.966   0.142 
#>  3 0.672   0.0849
#>  4 0.641   0.155 
#>  5 0.490   0.187 
#>  6 0.00394 0.295 
#>  7 0.126   0.813 
#>  8 0.202   0.474 
#>  9 0.0740  0.107 
#> 10 0.412   0.709 
#> 11 0.509   0.253

使用 choose(k,n) 将数据拆分为 5 个子集，而不使用 sample()

Split data in 5 subsets with choose(k,n) & NOT with sample()

random

test-data

r

training-data

cross-validation