使用 H2OGradientBoostingEstimator 避免过度拟合

Avoiding overfitting with H2OGradientBoostingEstimator

看来交叉验证和训练 AUC ROC 之间的差异 H2OGradientBoostingEstimator remains high despite my best attempts using min_split_improvement

使用与 GradientBoostingClassifier(min_samples_split=10) results in no overfitting, but 相同的数据。

准备数据

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, n_features=40,
                           n_clusters_per_class=10,
                           n_informative=25,
                           random_state=12, shuffle=False)

features = ["x%02d" % (i) for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=features)
df["y"] = y
nfolds = 5

import h2o
h2o.init()

h2of = h2o.H2OFrame(df)
h2of["y"] = h2of["y"].asfactor()

运行造型

def print_h2o_auc(m):
    print("{m} train: {a:.2%} xv: {x:.2%}".format(
        m=m.model_id, a=m.auc(), x=float(m.cross_validation_metrics_summary().as_data_frame().set_index("").loc["auc","mean"])))

from h2o.estimators.gbm import H2OGradientBoostingEstimator

for msi in [0.00001, 0.0001, 0.001, 0.01, 0.1]:
    m = H2OGradientBoostingEstimator(
        model_id="gbm %g" % (msi),
        ntrees=100, max_depth=3, min_rows=100, min_split_improvement=msi,
        nfolds=5, fold_assignment="stratified",
        keep_cross_validation_predictions=True, seed=1)
    m.train(x=features, y="y", training_frame=h2of)
    print_h2o_auc(m)

版画

gbm 1e-05 train: 84.35% xv: 77.12%
gbm 0.0001 train: 84.35% xv: 77.12%
gbm 0.001 train: 82.71% xv: 76.53%
gbm 0.01 train: 68.06% xv: 65.49%
gbm 0.1 train: 50.00% xv: 50.00%

IOW,性能差异仍然很大(即使它确实下降了)。

我还能尝试什么来减少过度拟合?

min_split_improvment 只是您可以使用的因素之一。如果我是你,我会使用 GridSearch

的小范围因子
  1. ntrees
  2. learn_rate
  3. sample_rate
  4. col_sample_rate_per_tree

所有这些对于避免过度拟合都很重要。