Caret - 基于多个变量创建分层数据集
Caret - creating stratified data sets based on several variables
在 R 包 caret 中,我们可以使用函数 createDataPartition()(或用于交叉验证的 createFolds())基于多个变量创建分层训练和测试集吗?
这是一个变量的例子:
#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]
在上面的代码中,训练集和测试集按 'df$yourFactor' 分层。但是是否可以使用多个变量进行分层(例如 'df$yourFactor' 和 'df$yourFactor2')?以下代码似乎有效,但我不知道它是否正确:
inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)
有更好的方法。
set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE),
yourFactor2 = rbinom(n,1,.5),
yourFactor3 = rbinom(n,1,.7))
阶层指标
d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])
样本选择
indices <- tapply(1:nrow(d), d$group, sample, 30 )
获取子样本
subsampd <- d[unlist(indices, use.names = FALSE), ]
这样做是在 yourFactor
和 yourFactor2
的每个组合上制作一个 30 大小的随机分层样本。
如果您使用 tidyverse
.
,这将相当简单
例如:
df <- df %>%
mutate(n = row_number()) %>% #create row number if you dont have one
select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'
在 R 包 caret 中,我们可以使用函数 createDataPartition()(或用于交叉验证的 createFolds())基于多个变量创建分层训练和测试集吗?
这是一个变量的例子:
#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]
在上面的代码中,训练集和测试集按 'df$yourFactor' 分层。但是是否可以使用多个变量进行分层(例如 'df$yourFactor' 和 'df$yourFactor2')?以下代码似乎有效,但我不知道它是否正确:
inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)
有更好的方法。
set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE),
yourFactor2 = rbinom(n,1,.5),
yourFactor3 = rbinom(n,1,.7))
阶层指标
d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])
样本选择
indices <- tapply(1:nrow(d), d$group, sample, 30 )
获取子样本
subsampd <- d[unlist(indices, use.names = FALSE), ]
这样做是在 yourFactor
和 yourFactor2
的每个组合上制作一个 30 大小的随机分层样本。
如果您使用 tidyverse
.
例如:
df <- df %>%
mutate(n = row_number()) %>% #create row number if you dont have one
select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'