createDataPartition 如何从 caret 包拆分数据?

How does createDataPartition function from caret package split data?

来自文档:

For bootstrap samples, simple random sampling is used.

For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits.

For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups.

For createDataPartition, the number of percentiles is set via the groups argument.

我不明白为什么需要这种“平衡”的东西。我想我只是肤浅地理解了它,但任何额外的见解都会非常有帮助。

意思是,如果你有一个数据集 ds 有 10000 行

set.seed(42)
ds <- data.frame(values = runif(10000))

2 "classes" 分布不均(9000 对 1000)

ds$class <- c(rep(1, 9000), rep(2, 1000))
ds$class <- as.factor(ds$class)
table(ds$class)
#    1    2 
# 9000 1000 

您可以创建一个示例,它会尝试保持 factor 类 的比率 / "balance"。

dpart <- createDataPartition(ds$class, p = 0.1, list = F)
dsDP <- ds[dpart, ]
table(dsDP$class)
#   1   2 
# 900 100