R插入符号:将主题与数据子集进行交叉验证以进行培训?
R caret: leave subject out cross validation with data subset for training?
我想使用 R 插入符号执行遗漏主题交叉验证(参见 this example),但只使用训练中的数据子集来创建 CV 模型。尽管如此,遗漏的 CV 分区应该作为一个整体使用,因为我需要测试遗漏主题的所有数据(无论是否是由于计算限制而无法用于训练的数百万样本)。
我使用 caret::train
和 caret::trainControl
的 subset
和 index
参数创建了一个最小的 2 class class化示例为了达成这个。根据我的观察,这应该可以解决问题,但实际上我很难确保评估仍然以遗漏主题的方式进行。也许有此任务经验的人可以对此有所了解:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV indexes for trainControl
index <- llply(unique(d[,3]), function(cls) c(which(d[,3]!=cls)))
names(index) <- paste0('sub_', unique(d[,3]))
str(index) # indexes used for training models with CV = OK
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
subset = sample(nrow(d), 5000), # does this subset the data used for training and obtaining models, but not the left out partition used for estimating CV performance?
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index))
str(m3$resample) # all samples used once = OK
# performance over all subjects
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
情节(myRoc,主要= 'all')
个别科目的表现
l_ply(唯一(m3$pred$Resample), .fun = function(cls) {
pred_sub <- m3$pred[m3$pred$Resample==cls,]
myRoc <- roc(预测器 = pred_sub[3],响应 = pred_sub$obs)
情节(myRoc,主要= cls)
})
感谢您的宝贵时间!
同时在 caret::trainControl
中使用 index
和 indexOut
参数似乎可以解决问题(感谢 Max 的提示 in this question)。这是更新后的代码:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
str(diamonds)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV partitions for training and left out partitions for evaluation
indexes_populationIndependence_subjects <- 3
index <- llply(unique(d[,indexes_populationIndependence_subjects]), function(cls) c(which(d[,indexes_populationIndependence_subjects]!=cls)))
names(index) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
indexOut <- llply(index, function(part) (1:nrow(d))[-part])
names(indexOut) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
# subsample partitions for training
index <- llply(index, function(i) sample(i, 1000))
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index,
indexOut = indexOut))
m3$resample # seems OK
str(m3$pred) # seems OK
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
plot(myRoc, main = 'all')
# analyze results per subject
l_ply(unique(m3$pred$Resample), .fun = function(cls) {
pred_sub <- m3$pred[m3$pred$Resample==cls,]
myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs)
plot(myRoc, main = cls)
} )
不过,我不确定这是否真的 以独立于人口的方式进行估计,所以如果有人了解详细信息,请分享您的想法!
我想使用 R 插入符号执行遗漏主题交叉验证(参见 this example),但只使用训练中的数据子集来创建 CV 模型。尽管如此,遗漏的 CV 分区应该作为一个整体使用,因为我需要测试遗漏主题的所有数据(无论是否是由于计算限制而无法用于训练的数百万样本)。
我使用 caret::train
和 caret::trainControl
的 subset
和 index
参数创建了一个最小的 2 class class化示例为了达成这个。根据我的观察,这应该可以解决问题,但实际上我很难确保评估仍然以遗漏主题的方式进行。也许有此任务经验的人可以对此有所了解:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV indexes for trainControl
index <- llply(unique(d[,3]), function(cls) c(which(d[,3]!=cls)))
names(index) <- paste0('sub_', unique(d[,3]))
str(index) # indexes used for training models with CV = OK
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
subset = sample(nrow(d), 5000), # does this subset the data used for training and obtaining models, but not the left out partition used for estimating CV performance?
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index))
str(m3$resample) # all samples used once = OK
# performance over all subjects
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
情节(myRoc,主要= 'all')
个别科目的表现
l_ply(唯一(m3$pred$Resample), .fun = function(cls) { pred_sub <- m3$pred[m3$pred$Resample==cls,] myRoc <- roc(预测器 = pred_sub[3],响应 = pred_sub$obs) 情节(myRoc,主要= cls) })
感谢您的宝贵时间!
同时在 caret::trainControl
中使用 index
和 indexOut
参数似乎可以解决问题(感谢 Max 的提示 in this question)。这是更新后的代码:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
str(diamonds)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV partitions for training and left out partitions for evaluation
indexes_populationIndependence_subjects <- 3
index <- llply(unique(d[,indexes_populationIndependence_subjects]), function(cls) c(which(d[,indexes_populationIndependence_subjects]!=cls)))
names(index) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
indexOut <- llply(index, function(part) (1:nrow(d))[-part])
names(indexOut) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
# subsample partitions for training
index <- llply(index, function(i) sample(i, 1000))
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index,
indexOut = indexOut))
m3$resample # seems OK
str(m3$pred) # seems OK
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
plot(myRoc, main = 'all')
# analyze results per subject
l_ply(unique(m3$pred$Resample), .fun = function(cls) {
pred_sub <- m3$pred[m3$pred$Resample==cls,]
myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs)
plot(myRoc, main = cls)
} )
不过,我不确定这是否真的 以独立于人口的方式进行估计,所以如果有人了解详细信息,请分享您的想法!