在 caret R 包中控制交叉验证的抽样
Controlling sampling for crossvalidation in the caret R package
我有以下问题。在来自 N 个主题的数据集中,每个主题我有几个样本。我想在数据集上训练一个模型,但我想确保在每次重采样中,在训练集中没有受试者的重复。
或者,我会按主题阻止交叉验证。这可能吗?
如果没有 caret 包,我会做类似的事情(模拟代码)
subjects <- paste0("X", 1:10)
samples <- rep(subjects, each=5)
x <- matrix(runif(50 * 10), nrow=50)
loocv <- function(x, samples) {
for(i in 1:nrow(x)) {
test <- x[i,]
train <- x[ samples != samples[i],]
# create the model from train and predict for test
}
}
或者,
looSubjCV <- function(x, samples, subjects) {
for(i in 1:length(subjects)) {
test <- x[ samples == subjects[i], ]
train <- x[ samples != subjects[i], ]
# create the model from train and predict for test
}
}
否则,同一受试者的其他样本的存在会导致模型过拟合。
不能直接使用,但您绝对可以使用 trainControl
的 index
和 indexOut
参数来实现。这是一个使用 10 倍 CV 的例子:
library(caret)
library(nlme)
data(Orthodont)
head(Orthodont)
subjects <- as.character(unique(Orthodont$Subject))
## figure out folds at the subject level
set.seed(134)
sub_folds <- createFolds(y = subjects, list = TRUE, returnTrain = TRUE)
## now create the mappings to which *rows* are in the training set
## based on which subjects are left in or out
in_train <- holdout <- vector(mode = "list", length = length(sub_folds))
row_index <- 1:nrow(Orthodont)
for(i in seq(along = sub_folds)) {
## Which subjects are in fold i
sub_in <- subjects[sub_folds[[i]]]
## which rows of the data correspond to those subjects
in_train[[i]] <- row_index[Orthodont$Subject %in% sub_in]
holdout[[i]] <- row_index[!(Orthodont$Subject %in% sub_in)]
}
names(in_train) <- names(holdout) <- names(sub_folds)
ctrl <- trainControl(method = "cv",
savePredictions = TRUE,
index = in_train,
indexOut = holdout)
mod <- train(distance ~ (age+Sex)^2, data = Orthodont,
method = "lm",
trControl = ctrl)
first_fold <- subset(mod$pred, Resample == "Fold01")
## These were used to fit the model
table(Orthodont$Subject[-first_fold$rowIndex])
## These were heldout:
table(Orthodont$Subject[first_fold$rowIndex])
我有以下问题。在来自 N 个主题的数据集中,每个主题我有几个样本。我想在数据集上训练一个模型,但我想确保在每次重采样中,在训练集中没有受试者的重复。
或者,我会按主题阻止交叉验证。这可能吗?
如果没有 caret 包,我会做类似的事情(模拟代码)
subjects <- paste0("X", 1:10)
samples <- rep(subjects, each=5)
x <- matrix(runif(50 * 10), nrow=50)
loocv <- function(x, samples) {
for(i in 1:nrow(x)) {
test <- x[i,]
train <- x[ samples != samples[i],]
# create the model from train and predict for test
}
}
或者,
looSubjCV <- function(x, samples, subjects) {
for(i in 1:length(subjects)) {
test <- x[ samples == subjects[i], ]
train <- x[ samples != subjects[i], ]
# create the model from train and predict for test
}
}
否则,同一受试者的其他样本的存在会导致模型过拟合。
不能直接使用,但您绝对可以使用 trainControl
的 index
和 indexOut
参数来实现。这是一个使用 10 倍 CV 的例子:
library(caret)
library(nlme)
data(Orthodont)
head(Orthodont)
subjects <- as.character(unique(Orthodont$Subject))
## figure out folds at the subject level
set.seed(134)
sub_folds <- createFolds(y = subjects, list = TRUE, returnTrain = TRUE)
## now create the mappings to which *rows* are in the training set
## based on which subjects are left in or out
in_train <- holdout <- vector(mode = "list", length = length(sub_folds))
row_index <- 1:nrow(Orthodont)
for(i in seq(along = sub_folds)) {
## Which subjects are in fold i
sub_in <- subjects[sub_folds[[i]]]
## which rows of the data correspond to those subjects
in_train[[i]] <- row_index[Orthodont$Subject %in% sub_in]
holdout[[i]] <- row_index[!(Orthodont$Subject %in% sub_in)]
}
names(in_train) <- names(holdout) <- names(sub_folds)
ctrl <- trainControl(method = "cv",
savePredictions = TRUE,
index = in_train,
indexOut = holdout)
mod <- train(distance ~ (age+Sex)^2, data = Orthodont,
method = "lm",
trControl = ctrl)
first_fold <- subset(mod$pred, Resample == "Fold01")
## These were used to fit the model
table(Orthodont$Subject[-first_fold$rowIndex])
## These were heldout:
table(Orthodont$Subject[first_fold$rowIndex])