h2o randomForest 中样本外表现不佳

Poor Out of Sample Performance in h2o randomForest

我是 运行 一个在 R 中使用 h2o 的随机森林模型。练习是一个二元分类问题,我有大约。 ‘1’的数量是‘0’的 5 倍。

因为数据集是时间序列(大约 50k 个观察值),所以我使用的是不断增长的 window 验证方案,而不是大多数 ML 案例中使用的典型 CV 方案。对于每个步骤,我都使用截至该时间点的 60% 可用数据训练模型,然后将剩余的 40% 平均分配给验证框架和测试数据。

我在验证数据和测试数据中的表现都非常差,这表明我对训练数据集过度拟合了。如果有人能在我设置模型和超参数搜索的过程中发现任何明显的、明显的错误,我将不胜感激。可能是我有足够数量的特征 (n = 27) 来充分捕获响应,但我想先排除不正确的模型规范。

这是我的模型和超参数网格搜索的规范。

# create feature names
y <- "Response"
x <- setdiff(names(gwWF1_train[, -c(1:3)]), y)

# turn training set into h2o object
gwWF1_train.h2o <- as.h2o(gwWF1_train[, -c(1:3)])

# turn validation set into h2o object
gwWF1_valid.h2o <- as.h2o(gwWF1_valid[, -c(1:3)])

# hyperparameter grid
hyper_grid.h2o <- list(
  ntrees      = seq(50, 1000, by = 50),
  mtries      = seq(2, 10, by = 1),
  max_depth   = seq(2, 10, by = 1),
  min_rows    = seq(5, 15, by = 1),
  nbins       = seq(5, 40, by = 5),
  sample_rate = c(.55, .632, .75)
)

# random grid search criteria
search_criteria <- list(
  strategy = "RandomDiscrete",
  stopping_metric = "auc",
  stopping_tolerance = 0.005,
  stopping_rounds = 20,
  max_runtime_secs = 5*60
  )

# build grid search 
random_grid <- h2o.grid(
  algorithm = "randomForest",
  grid_id = "rf_grid",
  x = x, 
  y = y, 
  seed = 29,
  balance_classes = TRUE,
  training_frame = gwWF1_train.h2o,
  validation_frame = gwWF1_valid.h2o,
  hyper_params = hyper_grid.h2o,
  search_criteria = search_criteria
  )
  
# collect the results and sort by our model performance metric of choice
grid_perf <- h2o.getGrid(
  grid_id = "rf_grid", 
  sort_by = "auc", 
  decreasing = TRUE
  )
print(grid_perf)

这是在训练和验证框架上的表现

Model Details:
==============

H2OBinomialModel: drf
Model ID:  rf_grid_model_14 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1             100                      100              224183        10        10   10.00000         99        272   173.86000


H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.1154841
RMSE:  0.3398295
LogLoss:  0.3836627
Mean Per-Class Error:  0.3434005
AUC:  0.716039
AUCPR:  0.3877677
Gini:  0.432078
R^2:  0.08126146

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
           0    1    Error         Rate
0      12089 2009 0.142502  =2009/14098
1       1327 1111 0.544299   =1327/2438
Totals 13416 3120 0.201742  =3336/16536

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold        value idx
1                       max f1  0.182031     0.399784 175
2                       max f2  0.133141     0.498254 264
3                 max f0point5  0.229674     0.437443 119
4                 max accuracy  0.263718     0.863147  94
5                max precision  0.705162     1.000000   0
6                   max recall  0.019641     1.000000 396
7              max specificity  0.705162     1.000000   0
8             max absolute_mcc  0.209525     0.298843 139
9   max min_per_class_accuracy  0.153280     0.651354 225
10 max mean_per_class_accuracy  0.174269     0.661033 188
11                     max tns  0.705162 14098.000000   0
12                     max fns  0.705162  2437.000000   0
13                     max fps  0.010717 14098.000000 399
14                     max tps  0.019641  2438.000000 396
15                     max tnr  0.705162     1.000000   0
16                     max fnr  0.705162     0.999590   0
17                     max fpr  0.010717     1.000000 399
18                     max tpr  0.019641     1.000000 396

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: drf
** Reported on validation data. **

MSE:  0.1340771
RMSE:  0.3661654
LogLoss:  0.4422115
Mean Per-Class Error:  0.5
AUC:  0.5295048
AUCPR:  0.1770512
Gini:  0.05900952
R^2:  -0.006636504

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       0     1    Error          Rate
0      0 13220 1.000000  =13220/13220
1      0  2485 0.000000       =0/2485
Totals 0 15705 0.841770  =13220/15705

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold        value idx
1                       max f1  0.025167     0.273227 399
2                       max f2  0.025167     0.484500 399
3                 max f0point5  0.161209     0.203459 140
4                 max accuracy  0.291947     0.841770   1
5                max precision  0.291947     0.500000   1
6                   max recall  0.025167     1.000000 399
7              max specificity  0.302964     0.999924   0
8             max absolute_mcc  0.178262     0.056716 103
9   max min_per_class_accuracy  0.133292     0.525553 215
10 max mean_per_class_accuracy  0.140828     0.529093 195
11                     max tns  0.302964 13219.000000   0
12                     max fns  0.302964  2485.000000   0
13                     max fps  0.030503 13220.000000 398
14                     max tps  0.025167  2485.000000 399
15                     max tnr  0.302964     0.999924   0
16                     max fnr  0.302964     1.000000   0
17                     max fpr  0.030503     1.000000 398
18                     max tpr  0.025167     1.000000 399

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

这里是测试数据的表现

H2OBinomialMetrics: drf

MSE:  0.1293883
RMSE:  0.3597059
LogLoss:  0.4274263
Mean Per-Class Error:  0.4719334
AUC:  0.530868
AUCPR:  0.1588326
Gini:  0.06173602
R^2:  -0.001929629

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
          0     1    Error          Rate
0      4321 12888 0.748910  =12888/17209
1       603  2490 0.194956     =603/3093
Totals 4924 15378 0.664516  =13491/20302

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold        value idx
1                       max f1  0.128344     0.269612 283
2                       max f2  0.084278     0.473644 380
3                 max f0point5  0.140284     0.194700 248
4                 max accuracy  0.313605     0.847601   1
5                max precision  0.313605     0.333333   1
6                   max recall  0.050387     1.000000 399
7              max specificity  0.316369     0.999884   0
8             max absolute_mcc  0.128344     0.047063 283
9   max min_per_class_accuracy  0.152158     0.522793 210
10 max mean_per_class_accuracy  0.140284     0.531315 248
11                     max tns  0.316369 17207.000000   0
12                     max fns  0.316369  3093.000000   0
13                     max fps  0.062918 17209.000000 398
14                     max tps  0.050387  3093.000000 399
15                     max tnr  0.316369     0.999884   0
16                     max fnr  0.316369     1.000000   0
17                     max fpr  0.062918     1.000000 398
18                     max tpr  0.050387     1.000000 399

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

你能post列出所有构建的模型及其性能吗?最好的深度为 10,了解探索了哪些其他超参数会很有用。

你的最大运行时间只有 5 分钟,可能大部分超参数 space 都没有被探索过。让我们看看还构建了哪些其他模型。