在 R 中为 k 折(特定于 ID)创建向量
Create a vector for k-folds (ID specific) in R
我一直在尝试各种方法,但都以错误消息和奇怪的东西告终。我目前正在使用 SurvSL 的功能,但我想根据我的特定需求对其进行微调。这是完整的功能:
#function to compute k-fold cross-validated concordance index for Lasso-Cox, Ridge-Cox, EN-Cox
c_indexCv_combined1 = function(data,k){
y_dat = Surv(data$obs.time,data$status)
set.seed(1)
folds = sample(rep(1:k, length.out = nrow(data)))
prediction_lasso = c()
prediction_ridge = c()
prediction_net = c()
index =c()
for (j in 1:k){
idx = which(folds==j)
train = data[-idx,]
test = data[idx,]
y_train = Surv(train$obs.time, train$status)
y_test = Surv(test$obs.time,test$status)
x = model.matrix(~., data[,-c(1,2)])
fit_lasso = glmnet(x[-idx,],y_train, family="cox", alpha=1)
cvFit_lasso = cv.glmnet(x[-idx,],y_train, family="cox", alpha=1)
pred_lasso = predict(fit_lasso,x[idx,], s=cvFit_lasso$lambda.min, type="link")
fit_ridge = glmnet(x[-idx,],y_train, family="cox", alpha=0)
cvFit_ridge = cv.glmnet(x[-idx,],y_train, family="cox", alpha=0)
pred_ridge = predict(fit_ridge,x[idx,], s=cvFit_ridge$lambda.min, type="link")
fit_net = glmnet(x[-idx,],y_train, family="cox", alpha=0.5)
cvFit_net = cv.glmnet(x[-idx,],y_train, family="cox", alpha=0.5)
pred_net = predict(fit_net,x[idx,], s=cvFit_net$lambda.min, type="link")
index = c(index,idx)
prediction_lasso = c(prediction_lasso, pred_lasso)
prediction_ridge = c(prediction_ridge, pred_ridge)
prediction_net = c(prediction_net, pred_net)
}
Match = match(seq(nrow(data)), index)
prediction_lasso = prediction_lasso[Match]
prediction_ridge = prediction_ridge[Match]
prediction_net = prediction_net[Match]
c_lasso = survConcordance(y_dat~prediction_lasso)$concordance
c_ridge = survConcordance(y_dat~prediction_ridge)$concordance
c_net = survConcordance(y_dat~prediction_net)$concordance
final_pred = cbind(prediction_lasso, prediction_ridge, prediction_net)
return(list(pred = final_pred, c_index=c(c_lasso, c_ridge, c_net)))
}
现在我需要修改的是这部分:
folds = sample(rep(1:k, length.out = nrow(data)))
folds 成为一个向量,其中包含 1 到 5 之间的数字的 1459 倍,因此我可以相应地 "fold" 我的 1459 个观察结果(在 5 组 k=5 中)。但是,我的数据中有一个 "ID" 变量。大多数时候它是一个唯一的数字。但有时会有doubles/triples。相同的 ID 号获得相同的折叠数非常重要(并且我在两个不同的折叠中没有相同的 ID)。我有 1459 个观察结果和 1240 个不同的 "ID"。如果我想要 5 折 (k),那么每折应该有 (1240/5=) 248 个不同的 ID 号。
有人知道 cool/simple 函数来管理这个吗?在 R 中玩了很多次之后,我开始担心我将不得不为 1459 obs 手动创建该向量...
提前致谢!
您可以先对唯一 ID 进行采样,然后匹配行。示例:
k <- 5
set.seed(1)
samplepool <- paste0("ID_", sprintf("%04d", 1:1240))
df <- data.frame(idx=1:1459,
ID=sort(c(sample(samplepool, (1459-1240), replace = TRUE), samplepool)))
folds <- sample(rep(1:k, length.out = length(unique(df$ID))))
folds <- folds[match(df$ID, unique(df$ID))]
由 reprex package (v0.3.0)
于 2020-05-05 创建
所以在您的代码中,假设 ID 变量 ID
,您将替换
folds = sample(rep(1:k, length.out = nrow(data)))
和
folds = sample(rep(1:k, length.out = length(unique(data$ID))))
folds = folds[match(data$ID, unique(data$ID))]
我一直在尝试各种方法,但都以错误消息和奇怪的东西告终。我目前正在使用 SurvSL 的功能,但我想根据我的特定需求对其进行微调。这是完整的功能:
#function to compute k-fold cross-validated concordance index for Lasso-Cox, Ridge-Cox, EN-Cox
c_indexCv_combined1 = function(data,k){
y_dat = Surv(data$obs.time,data$status)
set.seed(1)
folds = sample(rep(1:k, length.out = nrow(data)))
prediction_lasso = c()
prediction_ridge = c()
prediction_net = c()
index =c()
for (j in 1:k){
idx = which(folds==j)
train = data[-idx,]
test = data[idx,]
y_train = Surv(train$obs.time, train$status)
y_test = Surv(test$obs.time,test$status)
x = model.matrix(~., data[,-c(1,2)])
fit_lasso = glmnet(x[-idx,],y_train, family="cox", alpha=1)
cvFit_lasso = cv.glmnet(x[-idx,],y_train, family="cox", alpha=1)
pred_lasso = predict(fit_lasso,x[idx,], s=cvFit_lasso$lambda.min, type="link")
fit_ridge = glmnet(x[-idx,],y_train, family="cox", alpha=0)
cvFit_ridge = cv.glmnet(x[-idx,],y_train, family="cox", alpha=0)
pred_ridge = predict(fit_ridge,x[idx,], s=cvFit_ridge$lambda.min, type="link")
fit_net = glmnet(x[-idx,],y_train, family="cox", alpha=0.5)
cvFit_net = cv.glmnet(x[-idx,],y_train, family="cox", alpha=0.5)
pred_net = predict(fit_net,x[idx,], s=cvFit_net$lambda.min, type="link")
index = c(index,idx)
prediction_lasso = c(prediction_lasso, pred_lasso)
prediction_ridge = c(prediction_ridge, pred_ridge)
prediction_net = c(prediction_net, pred_net)
}
Match = match(seq(nrow(data)), index)
prediction_lasso = prediction_lasso[Match]
prediction_ridge = prediction_ridge[Match]
prediction_net = prediction_net[Match]
c_lasso = survConcordance(y_dat~prediction_lasso)$concordance
c_ridge = survConcordance(y_dat~prediction_ridge)$concordance
c_net = survConcordance(y_dat~prediction_net)$concordance
final_pred = cbind(prediction_lasso, prediction_ridge, prediction_net)
return(list(pred = final_pred, c_index=c(c_lasso, c_ridge, c_net)))
}
现在我需要修改的是这部分:
folds = sample(rep(1:k, length.out = nrow(data)))
folds 成为一个向量,其中包含 1 到 5 之间的数字的 1459 倍,因此我可以相应地 "fold" 我的 1459 个观察结果(在 5 组 k=5 中)。但是,我的数据中有一个 "ID" 变量。大多数时候它是一个唯一的数字。但有时会有doubles/triples。相同的 ID 号获得相同的折叠数非常重要(并且我在两个不同的折叠中没有相同的 ID)。我有 1459 个观察结果和 1240 个不同的 "ID"。如果我想要 5 折 (k),那么每折应该有 (1240/5=) 248 个不同的 ID 号。
有人知道 cool/simple 函数来管理这个吗?在 R 中玩了很多次之后,我开始担心我将不得不为 1459 obs 手动创建该向量...
提前致谢!
您可以先对唯一 ID 进行采样,然后匹配行。示例:
k <- 5
set.seed(1)
samplepool <- paste0("ID_", sprintf("%04d", 1:1240))
df <- data.frame(idx=1:1459,
ID=sort(c(sample(samplepool, (1459-1240), replace = TRUE), samplepool)))
folds <- sample(rep(1:k, length.out = length(unique(df$ID))))
folds <- folds[match(df$ID, unique(df$ID))]
由 reprex package (v0.3.0)
于 2020-05-05 创建所以在您的代码中,假设 ID 变量 ID
,您将替换
folds = sample(rep(1:k, length.out = nrow(data)))
和
folds = sample(rep(1:k, length.out = length(unique(data$ID))))
folds = folds[match(data$ID, unique(data$ID))]