h2o randomForest 中样本外表现不佳
Poor Out of Sample Performance in h2o randomForest
我是 运行 一个在 R 中使用 h2o 的随机森林模型。练习是一个二元分类问题,我有大约。 ‘1’的数量是‘0’的 5 倍。
因为数据集是时间序列(大约 50k 个观察值),所以我使用的是不断增长的 window 验证方案,而不是大多数 ML 案例中使用的典型 CV 方案。对于每个步骤,我都使用截至该时间点的 60% 可用数据训练模型,然后将剩余的 40% 平均分配给验证框架和测试数据。
我在验证数据和测试数据中的表现都非常差,这表明我对训练数据集过度拟合了。如果有人能在我设置模型和超参数搜索的过程中发现任何明显的、明显的错误,我将不胜感激。可能是我有足够数量的特征 (n = 27) 来充分捕获响应,但我想先排除不正确的模型规范。
这是我的模型和超参数网格搜索的规范。
# create feature names
y <- "Response"
x <- setdiff(names(gwWF1_train[, -c(1:3)]), y)
# turn training set into h2o object
gwWF1_train.h2o <- as.h2o(gwWF1_train[, -c(1:3)])
# turn validation set into h2o object
gwWF1_valid.h2o <- as.h2o(gwWF1_valid[, -c(1:3)])
# hyperparameter grid
hyper_grid.h2o <- list(
ntrees = seq(50, 1000, by = 50),
mtries = seq(2, 10, by = 1),
max_depth = seq(2, 10, by = 1),
min_rows = seq(5, 15, by = 1),
nbins = seq(5, 40, by = 5),
sample_rate = c(.55, .632, .75)
)
# random grid search criteria
search_criteria <- list(
strategy = "RandomDiscrete",
stopping_metric = "auc",
stopping_tolerance = 0.005,
stopping_rounds = 20,
max_runtime_secs = 5*60
)
# build grid search
random_grid <- h2o.grid(
algorithm = "randomForest",
grid_id = "rf_grid",
x = x,
y = y,
seed = 29,
balance_classes = TRUE,
training_frame = gwWF1_train.h2o,
validation_frame = gwWF1_valid.h2o,
hyper_params = hyper_grid.h2o,
search_criteria = search_criteria
)
# collect the results and sort by our model performance metric of choice
grid_perf <- h2o.getGrid(
grid_id = "rf_grid",
sort_by = "auc",
decreasing = TRUE
)
print(grid_perf)
这是在训练和验证框架上的表现
Model Details:
==============
H2OBinomialModel: drf
Model ID: rf_grid_model_14
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 100 100 224183 10 10 10.00000 99 272 173.86000
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.1154841
RMSE: 0.3398295
LogLoss: 0.3836627
Mean Per-Class Error: 0.3434005
AUC: 0.716039
AUCPR: 0.3877677
Gini: 0.432078
R^2: 0.08126146
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 12089 2009 0.142502 =2009/14098
1 1327 1111 0.544299 =1327/2438
Totals 13416 3120 0.201742 =3336/16536
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.182031 0.399784 175
2 max f2 0.133141 0.498254 264
3 max f0point5 0.229674 0.437443 119
4 max accuracy 0.263718 0.863147 94
5 max precision 0.705162 1.000000 0
6 max recall 0.019641 1.000000 396
7 max specificity 0.705162 1.000000 0
8 max absolute_mcc 0.209525 0.298843 139
9 max min_per_class_accuracy 0.153280 0.651354 225
10 max mean_per_class_accuracy 0.174269 0.661033 188
11 max tns 0.705162 14098.000000 0
12 max fns 0.705162 2437.000000 0
13 max fps 0.010717 14098.000000 399
14 max tps 0.019641 2438.000000 396
15 max tnr 0.705162 1.000000 0
16 max fnr 0.705162 0.999590 0
17 max fpr 0.010717 1.000000 399
18 max tpr 0.019641 1.000000 396
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: drf
** Reported on validation data. **
MSE: 0.1340771
RMSE: 0.3661654
LogLoss: 0.4422115
Mean Per-Class Error: 0.5
AUC: 0.5295048
AUCPR: 0.1770512
Gini: 0.05900952
R^2: -0.006636504
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 0 13220 1.000000 =13220/13220
1 0 2485 0.000000 =0/2485
Totals 0 15705 0.841770 =13220/15705
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.025167 0.273227 399
2 max f2 0.025167 0.484500 399
3 max f0point5 0.161209 0.203459 140
4 max accuracy 0.291947 0.841770 1
5 max precision 0.291947 0.500000 1
6 max recall 0.025167 1.000000 399
7 max specificity 0.302964 0.999924 0
8 max absolute_mcc 0.178262 0.056716 103
9 max min_per_class_accuracy 0.133292 0.525553 215
10 max mean_per_class_accuracy 0.140828 0.529093 195
11 max tns 0.302964 13219.000000 0
12 max fns 0.302964 2485.000000 0
13 max fps 0.030503 13220.000000 398
14 max tps 0.025167 2485.000000 399
15 max tnr 0.302964 0.999924 0
16 max fnr 0.302964 1.000000 0
17 max fpr 0.030503 1.000000 398
18 max tpr 0.025167 1.000000 399
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
这里是测试数据的表现
H2OBinomialMetrics: drf
MSE: 0.1293883
RMSE: 0.3597059
LogLoss: 0.4274263
Mean Per-Class Error: 0.4719334
AUC: 0.530868
AUCPR: 0.1588326
Gini: 0.06173602
R^2: -0.001929629
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 4321 12888 0.748910 =12888/17209
1 603 2490 0.194956 =603/3093
Totals 4924 15378 0.664516 =13491/20302
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.128344 0.269612 283
2 max f2 0.084278 0.473644 380
3 max f0point5 0.140284 0.194700 248
4 max accuracy 0.313605 0.847601 1
5 max precision 0.313605 0.333333 1
6 max recall 0.050387 1.000000 399
7 max specificity 0.316369 0.999884 0
8 max absolute_mcc 0.128344 0.047063 283
9 max min_per_class_accuracy 0.152158 0.522793 210
10 max mean_per_class_accuracy 0.140284 0.531315 248
11 max tns 0.316369 17207.000000 0
12 max fns 0.316369 3093.000000 0
13 max fps 0.062918 17209.000000 398
14 max tps 0.050387 3093.000000 399
15 max tnr 0.316369 0.999884 0
16 max fnr 0.316369 1.000000 0
17 max fpr 0.062918 1.000000 398
18 max tpr 0.050387 1.000000 399
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
你能post列出所有构建的模型及其性能吗?最好的深度为 10,了解探索了哪些其他超参数会很有用。
你的最大运行时间只有 5 分钟,可能大部分超参数 space 都没有被探索过。让我们看看还构建了哪些其他模型。
我是 运行 一个在 R 中使用 h2o 的随机森林模型。练习是一个二元分类问题,我有大约。 ‘1’的数量是‘0’的 5 倍。
因为数据集是时间序列(大约 50k 个观察值),所以我使用的是不断增长的 window 验证方案,而不是大多数 ML 案例中使用的典型 CV 方案。对于每个步骤,我都使用截至该时间点的 60% 可用数据训练模型,然后将剩余的 40% 平均分配给验证框架和测试数据。
我在验证数据和测试数据中的表现都非常差,这表明我对训练数据集过度拟合了。如果有人能在我设置模型和超参数搜索的过程中发现任何明显的、明显的错误,我将不胜感激。可能是我有足够数量的特征 (n = 27) 来充分捕获响应,但我想先排除不正确的模型规范。
这是我的模型和超参数网格搜索的规范。
# create feature names
y <- "Response"
x <- setdiff(names(gwWF1_train[, -c(1:3)]), y)
# turn training set into h2o object
gwWF1_train.h2o <- as.h2o(gwWF1_train[, -c(1:3)])
# turn validation set into h2o object
gwWF1_valid.h2o <- as.h2o(gwWF1_valid[, -c(1:3)])
# hyperparameter grid
hyper_grid.h2o <- list(
ntrees = seq(50, 1000, by = 50),
mtries = seq(2, 10, by = 1),
max_depth = seq(2, 10, by = 1),
min_rows = seq(5, 15, by = 1),
nbins = seq(5, 40, by = 5),
sample_rate = c(.55, .632, .75)
)
# random grid search criteria
search_criteria <- list(
strategy = "RandomDiscrete",
stopping_metric = "auc",
stopping_tolerance = 0.005,
stopping_rounds = 20,
max_runtime_secs = 5*60
)
# build grid search
random_grid <- h2o.grid(
algorithm = "randomForest",
grid_id = "rf_grid",
x = x,
y = y,
seed = 29,
balance_classes = TRUE,
training_frame = gwWF1_train.h2o,
validation_frame = gwWF1_valid.h2o,
hyper_params = hyper_grid.h2o,
search_criteria = search_criteria
)
# collect the results and sort by our model performance metric of choice
grid_perf <- h2o.getGrid(
grid_id = "rf_grid",
sort_by = "auc",
decreasing = TRUE
)
print(grid_perf)
这是在训练和验证框架上的表现
Model Details:
==============
H2OBinomialModel: drf
Model ID: rf_grid_model_14
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 100 100 224183 10 10 10.00000 99 272 173.86000
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.1154841
RMSE: 0.3398295
LogLoss: 0.3836627
Mean Per-Class Error: 0.3434005
AUC: 0.716039
AUCPR: 0.3877677
Gini: 0.432078
R^2: 0.08126146
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 12089 2009 0.142502 =2009/14098
1 1327 1111 0.544299 =1327/2438
Totals 13416 3120 0.201742 =3336/16536
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.182031 0.399784 175
2 max f2 0.133141 0.498254 264
3 max f0point5 0.229674 0.437443 119
4 max accuracy 0.263718 0.863147 94
5 max precision 0.705162 1.000000 0
6 max recall 0.019641 1.000000 396
7 max specificity 0.705162 1.000000 0
8 max absolute_mcc 0.209525 0.298843 139
9 max min_per_class_accuracy 0.153280 0.651354 225
10 max mean_per_class_accuracy 0.174269 0.661033 188
11 max tns 0.705162 14098.000000 0
12 max fns 0.705162 2437.000000 0
13 max fps 0.010717 14098.000000 399
14 max tps 0.019641 2438.000000 396
15 max tnr 0.705162 1.000000 0
16 max fnr 0.705162 0.999590 0
17 max fpr 0.010717 1.000000 399
18 max tpr 0.019641 1.000000 396
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: drf
** Reported on validation data. **
MSE: 0.1340771
RMSE: 0.3661654
LogLoss: 0.4422115
Mean Per-Class Error: 0.5
AUC: 0.5295048
AUCPR: 0.1770512
Gini: 0.05900952
R^2: -0.006636504
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 0 13220 1.000000 =13220/13220
1 0 2485 0.000000 =0/2485
Totals 0 15705 0.841770 =13220/15705
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.025167 0.273227 399
2 max f2 0.025167 0.484500 399
3 max f0point5 0.161209 0.203459 140
4 max accuracy 0.291947 0.841770 1
5 max precision 0.291947 0.500000 1
6 max recall 0.025167 1.000000 399
7 max specificity 0.302964 0.999924 0
8 max absolute_mcc 0.178262 0.056716 103
9 max min_per_class_accuracy 0.133292 0.525553 215
10 max mean_per_class_accuracy 0.140828 0.529093 195
11 max tns 0.302964 13219.000000 0
12 max fns 0.302964 2485.000000 0
13 max fps 0.030503 13220.000000 398
14 max tps 0.025167 2485.000000 399
15 max tnr 0.302964 0.999924 0
16 max fnr 0.302964 1.000000 0
17 max fpr 0.030503 1.000000 398
18 max tpr 0.025167 1.000000 399
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
这里是测试数据的表现
H2OBinomialMetrics: drf
MSE: 0.1293883
RMSE: 0.3597059
LogLoss: 0.4274263
Mean Per-Class Error: 0.4719334
AUC: 0.530868
AUCPR: 0.1588326
Gini: 0.06173602
R^2: -0.001929629
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 4321 12888 0.748910 =12888/17209
1 603 2490 0.194956 =603/3093
Totals 4924 15378 0.664516 =13491/20302
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.128344 0.269612 283
2 max f2 0.084278 0.473644 380
3 max f0point5 0.140284 0.194700 248
4 max accuracy 0.313605 0.847601 1
5 max precision 0.313605 0.333333 1
6 max recall 0.050387 1.000000 399
7 max specificity 0.316369 0.999884 0
8 max absolute_mcc 0.128344 0.047063 283
9 max min_per_class_accuracy 0.152158 0.522793 210
10 max mean_per_class_accuracy 0.140284 0.531315 248
11 max tns 0.316369 17207.000000 0
12 max fns 0.316369 3093.000000 0
13 max fps 0.062918 17209.000000 398
14 max tps 0.050387 3093.000000 399
15 max tnr 0.316369 0.999884 0
16 max fnr 0.316369 1.000000 0
17 max fpr 0.062918 1.000000 398
18 max tpr 0.050387 1.000000 399
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
你能post列出所有构建的模型及其性能吗?最好的深度为 10,了解探索了哪些其他超参数会很有用。
你的最大运行时间只有 5 分钟,可能大部分超参数 space 都没有被探索过。让我们看看还构建了哪些其他模型。