使用 R caret GBM 训练时发生错误。 "Error in { : task 1 failed - "参数暗示行数不同"

Question

我想用gbm解决分类问题。但是，当使用caret时，会出现以下错误。

Error in {: task 1 failed-"arguments imply differing number of rows: 0, 336"

作为参考，我的数据中没有 NA 或空值。 Here is my data

我用gbm包没有问题。 如果你知道为什么在使用 Caret 时会发生这种情况，请帮助我。

下面是我的代码和会话信息。

if(!require(caret)){install.packages('caret', dep=TRUE);require(caret)}
if(!require(data.table)){install.packages('data.table', dep=TRUE);require(data.table)}
if(!require(gbm)){install.packages('gbm', dep=TRUE);require(gbm)}

trainSet <- fread(file="trainSet.csv")

trainSet$result <- as.factor(trainSet$result)

fitControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 5
) 

#Error in { : task 1 failed - "arguments imply differing number of rows: 0, 336"
model_gbm_caret<-train(result~ +size_delta+inserted_line+deleted_line+size, 
                       data = trainSet, 
                       method='gbm', 
                       trControl = fitControl,
                       verbose=TRUE)

#no error
model_gbm<-gbm(result~+size_delta+inserted_line+deleted_line+size, data=trainSet, cv.folds = 2)

会话信息

(64-bit) Running under: Windows Server 2008 R2 x64 (build 7601)
Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=Korean_Korea.949  LC_CTYPE=Korean_Korea.949   
LC_MONETARY=Korean_Korea.949 LC_NUMERIC=C                 [5]
LC_TIME=Korean_Korea.949    

attached base packages: [1] stats     graphics  grDevices utils    
datasets  methods   base     

other attached packages: [1] gbm_2.1.5         data.table_1.12.8
caret_6.0-86      ggplot2_3.3.0     lattice_0.20-40  

loaded via a namespace (and not attached):  [1] Rcpp_1.0.4          
pillar_1.4.3         compiler_3.5.3       gower_0.2.1         
plyr_1.8.6            [6] iterators_1.0.12     class_7.3-15        
tools_3.5.3          rpart_4.1-15         packrat_0.5.0        [11]
ipred_0.9-9          lubridate_1.7.4      lifecycle_0.2.0     
tibble_2.1.3         nlme_3.1-137         [16] gtable_0.3.0        
pkgconfig_2.0.3      rlang_0.4.5          Matrix_1.2-18       
foreach_1.5.0        [21] rstudioapi_0.11      parallel_3.5.3      
prodlim_2019.11.13   e1071_1.7-3          gridExtra_2.3        [26]
stringr_1.4.0        withr_2.1.2          dplyr_0.8.5         
pROC_1.16.2          generics_0.0.2       [31] recipes_0.1.10      
stats4_3.5.3         nnet_7.3-13          grid_3.5.3          
tidyselect_1.0.0     [36] glue_1.3.2           R6_2.4.1            
survival_3.1-11      lava_1.6.7           reshape2_1.4.3       [41]
purrr_0.3.3          magrittr_1.5         ModelMetrics_1.2.2.2
splines_3.5.3        scales_1.1.0         [46] codetools_0.2-16    
MASS_7.3-51.5        rsconnect_0.8.16     assertthat_0.2.1    
timeDate_3043.102    [51] colorspace_1.4-1     stringi_1.4.6       
munsell_0.5.0        crayon_1.3.4  ```

感谢您的帮助！

Answer 1

有几个问题，如果你看看你试图预测的东西，它真的没有意义：

library(gbm)
library(data.table)
library(caret)

trainSet <- fread("https://raw.githubusercontent.com/kyrios05/R-Machine-Learning/master/trainSet.csv")

table(trainSet$result)

  1   8   9  10  11  14  15  16  17  18  19  20  22  23  24  26  28  30  31  33 
  3   3   3   2  24   3   8   3   4   2  12   5  41   5   3  63   5   3   4   3 
 36  38  39  42  43  44  46  47  48  49  50  51  52  53  54  55  56  57  58  59 
  3   3   2   5   6   2   2   3  28  14   4   3   5   3   3  10   8   2   6   6 
 60  61  62  65  67  70  72  73  74  75  76  77  79  80  81  82  83  85  87  88 
  5   9  10   3   5   4 813 257   6   3   9   9   2   3   3   6   2   5   3   6 
 90  92  93  94  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 
  3   2  20  13   5   3   3   9  42   2   2   3   7   2   2   4   2  13   2   3 
112 113 114 115 116 117 118 119 
  3  12   3   2   4   5   3   2

您正在尝试运行对看起来像离散值的内容进行分类。如果我运行 gbm，它运行s 但会抛出错误，因为标签太多类而数据太少！

trainSet$result = factor(trainSet$result)

model_gbm<-gbm(result~+size_delta+inserted_line+deleted_line+size, data=trainSet, cv.folds = 2)
Distribution not specified, assuming multinomial ...
Warning messages:
1: In predict.gbm(model, newdata = my.data, n.trees = best.iter.cv) :
  NAs introduced by coercion
2: In predict.gbm(model, newdata = my.data, n.trees = best.iter.cv) :
  NAs introduced by coercion

如果确实是分类，可以减少到3个类:

trainSet$label = as.character(trainSet$result)
trainSet$label[!trainSet$label %in% c(72,73)] <- "others"

fitControl <- trainControl(method = "cv",number=2) 
model_gbm_caret<-train(label~ +size_delta+inserted_line+deleted_line+size, 
                       data = trainSet, 
                       method='gbm', 
                       trControl = fitControl,
                       verbose=TRUE,distribution="multinomial")

或者你运行回归（我希望这是预期的）：

trainSet <- fread("https://raw.githubusercontent.com/kyrios05/R-Machine-Learning/master/trainSet.csv")
fitControl <- trainControl(method = "cv",number=2) 
model_gbm_caret<-train(result ~ +size_delta+inserted_line+deleted_line+size, 
                       data = trainSet, 
                       method='gbm', 
                       trControl = fitControl,
                       verbose=TRUE)

使用 R caret GBM 训练时发生错误。 "Error in { : task 1 failed - "参数暗示行数不同"

An error occurs when training with R caret GBM. "Error in { : task 1 failed - "arguments imply differing number of rows"

r

package

gbm

r-caret