使用分层抽样从一个数据集中创建 n 个数据集而不重复

Question

我有一个数据集 train 有 500 行，我想得到一个包含 n 列的数据框，每列包含 500/n 值（行号在其他列中没有重复）基于在 train 列的分层抽样中，比如 train$y.

我试过以下但它 returns 重复值，

library(caret)
n <- 10 # I want to divide my data set in to 10 parts
data_partition <- createDataPartition(y = train$y, times = 10, 
                                 p = 1/n, list = F)

用一个例子来总结一下，如果我有一个包含 100 行和其中一列 train$y（值 = 0 或 1）的数据集 train。我想从火车上得到 10 个数据集，每个数据集有 10 行，它们应该是 stratified 基于 train$y 并且它们不应该在其他 9 个数据集上看到。

示例输入：

预期输出（第一列 4 个，每列都留有详细信息）

ID  x   y   sample      set 1           set 2           set 3   
1   1   0   set 2       ID  x   y       ID  x   y       ID  x   y
2   2   0   set 3       8   4   1       11  2   1       17  9   1
3   3   1   set 3       9   3   1       12  3   0       5   2   1
4   1   1   set 3       10  1   1       13  4   1       6   4   1
5   2   1   set 3       18  3   0       1   1   0       7   4   0
6   4   1   set 3       19  7   0       14  5   1       2   2   0
7   4   0   set 3       20  8   1       15  6   1       3   3   1
8   4   1   set 1                       16  10  1       4   1   1
9   3   1   set 1                                               
10  1   1   set 1                                               
11  2   1   set 2                                               
12  3   0   set 2                                               
13  4   1   set 2                                               
14  5   1   set 2                                               
15  6   1   set 2                                               
16  10  1   set 2                                               
17  9   1   set 3                                               
18  3   0   set 1                                               
19  7   0   set 1                                               
20  8   1   set 1

在上面的示例中，给定的输入为 ID,x 和 y。 我想获取列 sample，我可以随时将其分隔到那 3 个表（右侧）中。

请注意，数据中的y有14- 1s和6- 0s，比例为70:30，输出集为比例差不多。

copy/run 友好格式的示例数据集：

data <- structure(list(ID = 1:20, x = c(1L, 2L, 3L, 1L, 2L, 4L, 4L, 4L, 
3L, 1L, 2L, 3L, 4L, 5L, 6L, 10L, 9L, 3L, 7L, 8L), y = c(0L, 0L, 
1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 
0L, 1L)), .Names = c("ID", "x", "y"), class = "data.frame", row.names = c(NA, 
-20L))

Answer 1

可以使用 caret 包来完成。试试下面的代码

# Createing dataset
data <- structure(list(ID = 1:20, x = c(1L, 2L, 3L, 1L, 2L, 4L, 4L, 4L, 
3L, 1L, 2L, 3L, 4L, 5L, 6L, 10L, 9L, 3L, 7L, 8L), y = c(0L, 0L, 
1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 
0L, 1L)), .Names = c("ID", "x", "y"), class = "data.frame", row.names = c(NA, -20L))
# Solution
library(caret)
k <- createFolds(data$y,k = 3,list = F)
addmargins(table(k,data$y))

使用分层抽样从一个数据集中创建 n 个数据集而不重复

Create n data sets from one data set without repetition using stratified sampling

r

random-sample