每 class 拆分数据 80/20

Question

在将我的数据分成 80/20 train/validation 组时，我正在寻求使我的样本在各种数据集中均匀分布。我不想随机进行，因为我需要在两组之间平均分配样本并避免产生偏差。但是，我想确保对于每个 class 标签，80% 的样本都在训练集中。

有了这个，我想尝试在 R 的插入符号包中执行此操作，例如：

data_split <- createDataPartition(y=data$column, p=0.8, list=F) #splits data
training <- data[data_split,] #call training data
testing <- data[-data_split,] #call testing or validation data

例如我有 64 classes 并且正在考虑对每个 class.

进行随机数据分配

这是否正确？

Answer 1

如果我没有正确理解你想要什么，那你就做得很好。 createDataPartition 函数正好适用于这种情况。它根据 vignette

上报告的结果执行简单拆分

the random sampling occurs within each class and should preserve the overall class distribution of the data

我们可以用一个简单的图来检查是否为真

library(caret)
library(ggplot2)
set.seed(5)
df <- 
data.frame(a=runif(1000),b=runif(1000)*10,c=sample(as.character(1:64),1000,replace = 
T))
str(df)
#split the data in 80/20 train/test
ind <- createDataPartition(df$c, p=0.8,list = F)
train <- df[ind,]
test <- df[-ind,]
#frequencies of each class for the whole dataset
x <- table(df$c)/length(df$c)
#for the training set
x_train <- table(train$c)/length(train$c)
#for the testing set
x_test<- table(test$c)/length(test$c)

freq <- data.frame(class=names(x),df=as.numeric(x),train=as.numeric(x_train),test=as.numeric(x_test))


ggplot(freq,aes(x=class))+
geom_line(aes(y=df,group=1),col="red")+
geom_line(aes(y=train,group=1),col="green")+
geom_line(aes(y=test,group=1),col="blue")+
ylab("frequencies")

如您所见，每个 class 的分布都得到了保留

每 class 拆分数据 80/20

Splitting data 80/20 per class

r

machine-learning

r-caret