Caret - 在 gafsControl() 中设置种子

Caret - Setting the seeds inside the gafsControl()

我正在尝试将 seeds 设置在插入符号的 gafsControl() 内,但出现此错误:

Error in { : task 1 failed - "supplied seed is not a valid integer"

我知道 trainControl()seeds 是一个向量,等于重采样数加一,模型调整参数的组合数(在我的例子中是 36,SVM 6 Sigma和 6 个成本值)在每个(重新采样)条目中。但是,我不知道我应该为 gafsControl() 使用什么。我试过 iters*popSize (100*10)、iters (100)、popSize (10),但 none 有效。

提前致谢。

这是我的代码(带有模拟数据):

library(caret)
library(doMC)
library(kernlab)

registerDoMC(cores=32)

set.seed(1234)
train.set <- twoClassSim(300, noiseVars = 100, corrVar = 100, corrValue = 0.75)

mylogGA <- caretGA
mylogGA$fitness_extern <- mnLogLoss

#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)

#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- vector(mode = "list", length = 4)
for(i in 1:3) ga_seeds[[i]] <- sample.int(1500, 1000)

## For the last model:
ga_seeds[[4]] <- sample.int(1000, 1)

#Index for the trainControl()
set.seed(1045481)
tr_index <- createFolds(train.set$Class, k=5)

#Seeds for the trainControl()
set.seed(1056)
tr_seeds <- vector(mode = "list", length = 6)
for(i in 1:5) tr_seeds[[i]] <- sample.int(1000, 36)#

## For the last model:
tr_seeds[[6]] <- sample.int(1000, 1)


gaCtrl <- gafsControl(functions = mylogGA,
                      method = "cv",
                      number = 3,
                      metric = c(internal = "logLoss",
                                 external = "logLoss"),
                      verbose = TRUE,
                      maximize = c(internal = FALSE,
                                   external = FALSE),
                      index = ga_index,
                      seeds = ga_seeds,
                      allowParallel = TRUE)

tCtrl = trainControl(method = "cv", 
                     number = 5,
                     classProbs = TRUE,
                     summaryFunction = mnLogLoss,
                     index = tr_index,
                     seeds = tr_seeds,
                     allowParallel = FALSE)


svmGrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), C= 2^c(0:5))

t1 <- Sys.time()
set.seed(1234235)
svmFuser.gafs <- gafs(x = train.set[, names(train.set) != "Class"],
                      y = train.set$Class,
                      gafsControl = gaCtrl,
                      trControl = tCtrl,
                      popSize = 10,
                      iters = 100,
                      method = "svmRadial",
                      preProc = c("center", "scale"),
                      tuneGrid = svmGrid,
                      metric="logLoss",
                      maximize = FALSE)

t2<- Sys.time()
svmFuser.gafs.time<-difftime(t2,t1)

save(svmFuser.gafs, file ="svmFuser.gafs.rda")
save(svmFuser.gafs.time, file ="svmFuser.gafs.time.rda")

Session 信息:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8       
 [4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
 [10] LC_TELEPHONE=C            LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] kernlab_0.9-22  doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   caret_6.0-52    ggplot2_1.0.1   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0         magrittr_1.5        splines_3.2.2        MASS_7.3-43         munsell_0.4.2      
 [6] colorspace_1.2-6    foreach_1.4.2       minqa_1.2.4         car_2.0-26          stringr_1.0.0      
 [11] plyr_1.8.3          tools_3.2.2         parallel_3.2.2      pbkrtest_0.4-2      nnet_7.3-10        
 [16] grid_3.2.2          gtable_0.1.2        nlme_3.1-122        mgcv_1.8-7          quantreg_5.18      
 [21] MatrixModels_0.4-1  iterators_1.0.7     gtools_3.5.0        lme4_1.1-9          digest_0.6.8       
 [26] Matrix_1.2-2        nloptr_1.0.4        reshape2_1.4.1      codetools_0.2-11    stringi_0.5-5      
 [31] compiler_3.2.2      BradleyTerry2_1.0-6 scales_0.3.0        stats4_3.2.2        SparseM_1.7        
 [36] brglm_0.5-9         proto_0.3-10       
> 

我通过检查 gafs.default 找出了我的错误。 gafsControl() 中的 seeds 接受长度为 (n_repeats*nresampling)+1vector,而不是 list(如 trainControl$seeds)。实际上 ?gafsControl 的文档中说 seeds is a vector or integers that can be used to set the seed during each search. The number of seeds must be equal to the number of resamples plus one. 我很难弄明白,这是提醒 仔细 阅读文档 :D。

    if (!is.null(gafsControl$seeds)) {
        if (length(gafsControl$seeds) < length(gafsControl$index) + 
            1) 
            stop(paste("There must be at least", length(gafsControl$index) + 
            1, "random number seeds passed to gafsControl"))
    }
    else {
        gafsControl$seeds <- sample.int(1e+05, length(gafsControl$index) + 
        1)
    }

所以,设置我的 ga_seeds 的正确方法是:

#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)

#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- sample.int(1500, 4)

我不太熟悉你提到的 gafsControl() 函数,但我在使用 trainControl() 设置并行种子时遇到了一个非常相似的问题。在说明中,它描述了如何创建一个列表(长度 = 重采样数 + 1),其中每个项目都是一个列表(长度 = 要测试的参数组合数)。我发现这样做是行不通的(有关信息,请参阅 topepo/caret 问题 #248)。但是,如果您随后将每个项目转换为向量,例如

seeds <- lapply(seeds, as.vector)

然后种子似乎起作用了(即模型和预测是完全可重现的)。我应该澄清一下,这是使用 doMC 作为后端。其他并行后端可能会有所不同。

希望对您有所帮助

如果以这种方式设置种子,您可以确保每个 运行 选择相同的特征子集吗?我问的是 GA

的随机性