R:如何在 SuperLearner 中指定我自己的 CV 折叠
R: how to specify my own CV folds in SuperLearner
library(SuperLearner)
library(MASS)
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
sl_cv = SuperLearner(Y = Y, X = X, family = gaussian(),
SL.library = c("SL.mean", "SL.ranger"),
verbose = TRUE, cvControl = list(V = 5))
在上面的代码中,我正在执行 5 折 CV 来训练 SuperLearner。但是,如果我想在数据中手动创建自己的折叠怎么办?我有兴趣尝试这个,因为我知道我的数据中有聚类,我想对我创建的折叠执行 CV。
例如下面是我的玩具数据的五折:split1
, ..., split5
。有没有办法使用这 5 折来执行交叉验证,而不是让 SuperLearner
自己拆分数据?
set.seed(1)
index <- sample(1:5, size = nrow(X), replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
split1 <- X[index == 1, ]
split2 <- X[index == 2, ]
split3 <- X[index == 3, ]
split4 <- X[index == 4, ]
split5 <- X[index == 5, ]
split1.y <- Y[index == 1]
split2.y <- Y[index == 2]
split3.y <- Y[index == 3]
split4.y <- Y[index == 4]
split5.y <- Y[index == 5]
交叉验证过程有一些控制参数。您可以使用 validRows
参数。您将需要一个包含 5 个元素的列表,每个元素都有一个包含与您预定义的集群相对应的所有行的向量。假设您添加了一个列来显示观察属于哪个集群,您可以这样写:
cluster1_ids = which(df$cluster==1) #similar for other cluster values
L = list(cluster1_ids, cluster2_ids, cluster3_ids, cluster4_ids, cluster5_ids)
X = df[-c("cluster")]
sl_cv = SuperLearner(Y = Y, X = X, family = gaussian(),
SL.library = c("SL.mean", "SL.ranger"),
verbose = TRUE, cvControl = list(V = 5, validRows=L))
希望对您有所帮助!
重复准备资料,就有完整的解决方案。
最后一行验证训练数据不包括验证数据。
library(SuperLearner)
library(MASS)
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
set.seed(1)
index <- sample(1:5, size = nrow(X), replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
validRows=list()
for (v in 1:5)
validRows[[v]] <- which(index==v)
sl_cv = SuperLearner(Y = Y, X = X, family = gaussian(),
SL.library = c("SL.mean", "SL.ranger"),
verbose = TRUE,
control = SuperLearner.control(saveCVFitLibrary = TRUE),
cvControl = list(V = 5, shuffle = FALSE, validRows = validRows))
# sample size deducted from length of declared validRows
n - sapply(sl_cv$validRows, length)
# sample size deducted from resulting models
sapply(1:5, function(i) length(sl_cv$cvFitLibrary[[i]]$SL.ranger_All$object$predictions))
library(SuperLearner)
library(MASS)
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
sl_cv = SuperLearner(Y = Y, X = X, family = gaussian(),
SL.library = c("SL.mean", "SL.ranger"),
verbose = TRUE, cvControl = list(V = 5))
在上面的代码中,我正在执行 5 折 CV 来训练 SuperLearner。但是,如果我想在数据中手动创建自己的折叠怎么办?我有兴趣尝试这个,因为我知道我的数据中有聚类,我想对我创建的折叠执行 CV。
例如下面是我的玩具数据的五折:split1
, ..., split5
。有没有办法使用这 5 折来执行交叉验证,而不是让 SuperLearner
自己拆分数据?
set.seed(1)
index <- sample(1:5, size = nrow(X), replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
split1 <- X[index == 1, ]
split2 <- X[index == 2, ]
split3 <- X[index == 3, ]
split4 <- X[index == 4, ]
split5 <- X[index == 5, ]
split1.y <- Y[index == 1]
split2.y <- Y[index == 2]
split3.y <- Y[index == 3]
split4.y <- Y[index == 4]
split5.y <- Y[index == 5]
交叉验证过程有一些控制参数。您可以使用 validRows
参数。您将需要一个包含 5 个元素的列表,每个元素都有一个包含与您预定义的集群相对应的所有行的向量。假设您添加了一个列来显示观察属于哪个集群,您可以这样写:
cluster1_ids = which(df$cluster==1) #similar for other cluster values
L = list(cluster1_ids, cluster2_ids, cluster3_ids, cluster4_ids, cluster5_ids)
X = df[-c("cluster")]
sl_cv = SuperLearner(Y = Y, X = X, family = gaussian(),
SL.library = c("SL.mean", "SL.ranger"),
verbose = TRUE, cvControl = list(V = 5, validRows=L))
希望对您有所帮助!
重复准备资料,就有完整的解决方案。 最后一行验证训练数据不包括验证数据。
library(SuperLearner)
library(MASS)
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
set.seed(1)
index <- sample(1:5, size = nrow(X), replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
validRows=list()
for (v in 1:5)
validRows[[v]] <- which(index==v)
sl_cv = SuperLearner(Y = Y, X = X, family = gaussian(),
SL.library = c("SL.mean", "SL.ranger"),
verbose = TRUE,
control = SuperLearner.control(saveCVFitLibrary = TRUE),
cvControl = list(V = 5, shuffle = FALSE, validRows = validRows))
# sample size deducted from length of declared validRows
n - sapply(sl_cv$validRows, length)
# sample size deducted from resulting models
sapply(1:5, function(i) length(sl_cv$cvFitLibrary[[i]]$SL.ranger_All$object$predictions))