如何计算最优 max_depht 来训练具有大量特征的 ML 模型？

Question

我的数据框中每天有 N 个特征，回溯 20 天（时间序列）：我有 ~400 个特征 x 100k 行。

我正在尝试识别最重要的特征，所以我用这种方式训练了我的 XGBoost 模型：

model = xgb.XGBRegressor(learning_rate=0.01, n_estimators=1000, max_depth=20)

eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_metric="rmse", eval_set=eval_set, verbose=True, early_stopping_rounds=20)

然后：

def plot_fimportance(xgbmodel, df_x, top_n=30):
    features = df_x.columns.values
    mapFeat = dict(zip(["f"+str(i) for i in range(len(features))],features))
    ts = pd.Series(xgbmodel.booster().get_fscore())
    ts.index = ts.reset_index()['index'].map(mapFeat)
    ts.order()[-top_n:].plot(kind="barh", x = 'Feature', figsize = (8, top_n-10), title=("feature importance"))

plot_fimportance(model, df.drop(['label']))

听说参数max_depth应该这样计算：

max_depth = number of features / 3

我认为这可能适用于小数据集，但如果我用 max_depth=133 训练我的模型，我的 PC 可能会爆炸，而且我可能也会过度拟合。

如何使用如此庞大的特征计算 max_depth 的最优值？

Answer 1

那个等式没有给你最佳深度；这只是一种启发式方法。如果你想要最佳深度，那么你必须凭经验找到它：找到一个功能起点并在每个方向上变化。应用梯度下降来接近最佳答案。

如果您想要的只是计算机上运行的最大限制，您可能会繁琐地计算存储要求并找到最大值。为了平衡它与过度拟合......你需要做出权衡，你仍然坚持实验。

如何计算最优 max_depht 来训练具有大量特征的 ML 模型？

How to calc the optimal max_depht to train a ML model with a huge number of features?

python

machine-learning

scikit-learn

xgboost