如何防止 Rstudio 崩溃？

Question

我目前正在为我的考试做一个机器学习项目。我的电脑有 32gb RAM，并且有一个 12 核 I7。我的会话信息如下，

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils    
[6] datasets  methods   base     

other attached packages:
 [1] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
 [4] purrr_0.3.4     readr_1.4.0     tidyr_1.1.2    
 [7] tibble_3.0.4    tidyverse_1.3.0 here_1.0.1     
[10] caret_6.0-86    ggplot2_3.3.3   lattice_0.20-41

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           lubridate_1.7.9.2   
 [3] class_7.3-17         assertthat_0.2.1    
 [5] rprojroot_2.0.2      ipred_0.9-9         
 [7] foreach_1.5.1        R6_2.5.0            
 [9] cellranger_1.1.0     plyr_1.8.6          
[11] backports_1.2.1      reprex_0.3.0        
[13] stats4_4.0.3         httr_1.4.2          
[15] pillar_1.4.7         rlang_0.4.10        
[17] readxl_1.3.1         rstudioapi_0.13     
[19] data.table_1.13.6    rpart_4.1-15        
[21] Matrix_1.3-2         splines_4.0.3       
[23] gower_0.2.2          munsell_0.5.0       
[25] broom_0.7.3          compiler_4.0.3      
[27] modelr_0.1.8         pkgconfig_2.0.3     
[29] nnet_7.3-14          tidyselect_1.1.0    
[31] prodlim_2019.11.13   codetools_0.2-18    
[33] fansi_0.4.1          crayon_1.3.4        
[35] dbplyr_2.0.0         withr_2.3.0         
[37] MASS_7.3-53          recipes_0.1.15      
[39] ModelMetrics_1.2.2.2 grid_4.0.3          
[41] nlme_3.1-151         jsonlite_1.7.2      
[43] gtable_0.3.0         lifecycle_0.2.0     
[45] DBI_1.1.0            magrittr_2.0.1      
[47] pROC_1.16.2          scales_1.1.1        
[49] cli_2.2.0            stringi_1.5.3       
[51] reshape2_1.4.4       fs_1.5.0            
[53] timeDate_3043.102    xml2_1.3.2          
[55] ellipsis_0.3.1       generics_0.1.0      
[57] vctrs_0.3.6          lava_1.6.8.1        
[59] iterators_1.0.13     tools_4.0.3         
[61] glue_1.4.2           hms_0.5.3           
[63] survival_3.2-7       colorspace_2.0-0    
[65] rvest_0.3.6          haven_2.3.1

我的数据是 50.000 x 30，最初我使用以下代码训练我的模型来解决分类和回归问题，

models <- list()

# Generate cluster
genCluster <- makeCluster(
  spec = detectCores() - 1
)

registerDoParallel(
  cl = genCluster
)

set.seed(1903)
system.time(
  for (i in 1:length(Algorithms)){
    
   
    
    # train models
    suppressWarnings(
      models[[i]] <- train(
        form = Y ~ .,
        data = df,
        method = Algorithms[i],
        trControl = trainControl(
          method = "repeatedcv",
          number = 10,
          repeats = 3,
          index = myFolds,
          verboseIter = F,
          allowParallel = T
        )
      )
    )
    
    
  }
)

stopCluster(
  cl = genCluster
)

}

在我运行整个脚本之前，我从我的数据中随机抽取一个样本来测试我的脚本，看看它是否有效。所以在我的测试中运行我通常运行有 2000 个观察值。这通常很有效。

但是，每当我使用整个数据集时，我要么会遇到反序列化错误，要么会遇到一些相关的“死”-worker 错误。如果这没有发生，那么我的 R Session 就会崩溃。 注意： 当我通过我的大学超级计算机运行在具有 320gb RAM 的 64 核实例上使用相同的代码时，也会发生这种情况。

我是如何尝试解决问题的

我没有使用最大内核数，而是使用了等于 k 倍的内核数 - 所以 10。这对 worker/core 相关错误有所帮助（有点）。对于我的案例来说，这些错误似乎是相当随机的。但是，R Session 崩溃仍然存在。
我决定通过终端执行我的脚本，而不是使用 R Studio，但是，因为我的脚本中的每个相对路径都在根项目目录中，需要 30 多个脚本来改变这个似乎不成比例，因为 RStudio 应该工作。由于某些奇怪的原因，setwd()通过 R 终端不影响子脚本！
在执行每个繁重的脚本之前，我尝试清理环境和内存。

rm(
  list=setdiff(
    ls(), 
    c("importantParameters",
      "train.data",
      "estimateFoo",
      "bestPick")
  )
)


gc(full = T, verbose = F)

这并没有改变关于崩溃或 worker/core 相关错误的任何内容。

我的新方法

放弃后，我采用了一种新方法 mclapply。它相当慢，并且不像我想象的那样工作。请注意，我在这个版本中有 alllowParallel = F，正如我预期的那样 mclappy 到运行列表中的所有模型同时出现。据我从我的系统监视器中看到的那样，情况并非如此

estimateFoo <- function(algorithms, equation, cores, plot = F, data, trainObject, type = NULL, plot.name = NULL, metric = c("RMSE")){
  
  # Packages
  require(parallel)
  require(caret)
  require(tidyverse)
  
  # This function estimates all algorithms. Must be provided by a vector of characters.
  # FULL TrainObjects from Caret has to be provided.
  # If plot == T it plots in a tryCatch fashion, to avoid Errors.
  # NOTE: Type has to be oneof classification or regression (As the folders are named.)
  
  trainedModels <- suppressWarnings(mclapply(
      X = algorithms,
      FUN = function(x){
        
        tryCatch(
          train(
            form   = equation,
            data   = data,
            method = x,
            trControl = trainObject
          )
        )
        
      },
      mc.cores = cores
    )
  )
  
  
  
  # Identify TryErrors and remove them. Otherwise the
  # script breaks down
  tryErrorIndicator <- sapply(trainedModels, FUN = class) %in% c("try-error", "NULL")
   
  # # Remove TryErrors
  trainedModels <- trainedModels[!tryErrorIndicator]
  
  # Name List Elements
  names(trainedModels) <- algorithms[!tryErrorIndicator]
  
  # NOTE: It ignores NULL elements, which are due
  # to dead workers. This indicator removes them.
  deadWorker <- which(sapply(trainedModels, is.null))
  
  # If plot is true; then it plots all models and saves
  if (isTRUE(plot)){

    # Generate resamples; and remove those that are empty
    modelResample <- trainedModels[-deadWorker] %>%
      resamples()

    print(
      dotplot(
        modelResample,
        metric = metric,
        scales = list(x = list(relation = "free"),
                      y = list(cex = 1.2))
      )
    )


    dev.copy(pdf, here("results","models", paste(type), paste(plot.name)))
    dev.off()



  }
  
  return(
    trainedModels[-deadWorker]
  )
}

这种新方法虽然速度较慢，但很有效。但是，我的 RSession 仍然崩溃了！

我该怎么办？我如何正确在 R 中进行机器学习而不失去理智，并浪费 4 天时间试图让 R 运行宁我所有的代码而不会崩溃？

Answer 1

我会根据收到的评论回答我自己的问题。如果有人有贡献，或发现此 post 无关紧要 - 请将其标记为删除。

R Sessions 崩溃主要是由于内存不足。因此，如果您正在使用网格搜索训练模型，那么您需要粗略估计它将占用多少 RAM 以便运行它顺利进行。是否可以通过更改函数中的某些参数来限制RAM使用，例如设置returnData = F，由于时间限制，我没有测试
使用 allowParallel = T 训练您的模型，将在 worker 之间平均分配 RAM 量，因此 RAM 使用率大约以线性方式增加，因此在同时训练模型时 RAM 会很快用完.

因此，到目前为止，解决方案必须是获取更多 RAM、减少数据大小或限制网格搜索。

不要使用allowParallel = T而不考虑您拥有的 RAM 量。这对我来说是新的。我希望这对你有帮助，也对我有帮助。

如何防止 Rstudio 崩溃？

How to prevent Rstudio from crashing?

parallel-processing

r

rstudio

r-caret