如何使 H2OGradientBoostingEstimator 的 H2OGridSearch 在 spark 环境中可重复(再现性)?
How to make H2OGridSearch for H2OGradientBoostingEstimator repeatable (Reproducibility) in spark environment?
我正在使用以下代码 运行 苏打水中的 GBM。我已经设置了种子和 score_each_iteration,但每次检查 AUC 时它仍然会生成不同的结果,即使我已经设置了种子和 score_each_iteration=True。
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.gbm import H2OGradientBoostingEstimator
# initialize the estimator
gbm_cov = H2OGradientBoostingEstimator(sample_rate = 0.7, col_sample_rate = 0.7, ntrees = 1000, balance_classes=True , score_each_iteration=True, nfolds=5, seed = 1234)
# set up hyper parameter search space
gbm_hyper_params = {'learn_rate': [0.01, 0.015, 0.025, 0.05, 0.1],
'max_depth': [3, 5, 7, 9, 12],
#'sample_rate': [i * 0.1 for i in range(6, 11)],
#'col_sample_rate': [i * 0.1 for i in range(6, 11)],
#'ntrees': [i * 100 for i in range(1, 11)]
}
# define Search criteria
gbm_search_criteria = {'strategy': "RandomDiscrete",
'max_models': 10,
'max_runtime_secs': 1800,
'stopping_metric': eval_metric,
'stopping_tolerance': 0.001,
'stopping_rounds': 3,
'seed': 1
}
# build grid search
gbm_grid = H2OGridSearch(model = gbm_cov,
hyper_params = gbm_hyper_params,
search_criteria = gbm_search_criteria # we can use "Cartesian" if search space is small
)
# train using the grid
gbm_grid.train(x = top_feature, y = y, training_frame =htrain)
注释掉
'max_runtime_secs': 1800
可以解决重现性问题。我发现的另一件事但我不知道为什么,如果我们将提前停止代码从搜索条件移动到 H2OGradientBoostingEstimator,代码将 运行 更快。
'stopping_metric': eval_metric,
'stopping_tolerance': 0.001,
'stopping_rounds': 3,
我正在使用以下代码 运行 苏打水中的 GBM。我已经设置了种子和 score_each_iteration,但每次检查 AUC 时它仍然会生成不同的结果,即使我已经设置了种子和 score_each_iteration=True。
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.gbm import H2OGradientBoostingEstimator
# initialize the estimator
gbm_cov = H2OGradientBoostingEstimator(sample_rate = 0.7, col_sample_rate = 0.7, ntrees = 1000, balance_classes=True , score_each_iteration=True, nfolds=5, seed = 1234)
# set up hyper parameter search space
gbm_hyper_params = {'learn_rate': [0.01, 0.015, 0.025, 0.05, 0.1],
'max_depth': [3, 5, 7, 9, 12],
#'sample_rate': [i * 0.1 for i in range(6, 11)],
#'col_sample_rate': [i * 0.1 for i in range(6, 11)],
#'ntrees': [i * 100 for i in range(1, 11)]
}
# define Search criteria
gbm_search_criteria = {'strategy': "RandomDiscrete",
'max_models': 10,
'max_runtime_secs': 1800,
'stopping_metric': eval_metric,
'stopping_tolerance': 0.001,
'stopping_rounds': 3,
'seed': 1
}
# build grid search
gbm_grid = H2OGridSearch(model = gbm_cov,
hyper_params = gbm_hyper_params,
search_criteria = gbm_search_criteria # we can use "Cartesian" if search space is small
)
# train using the grid
gbm_grid.train(x = top_feature, y = y, training_frame =htrain)
注释掉 'max_runtime_secs': 1800 可以解决重现性问题。我发现的另一件事但我不知道为什么,如果我们将提前停止代码从搜索条件移动到 H2OGradientBoostingEstimator,代码将 运行 更快。
'stopping_metric': eval_metric,
'stopping_tolerance': 0.001,
'stopping_rounds': 3,