如何在决策树回归器中定义 'min sample split' 和 'min sample leaf'？

Question

我正在处理一个由 20060 行和 10 列组成的数据集，并且我正在接近决策树回归器进行预测。

我愿意使用 RandomizedsearchCV 来调整超参数；我的疑问是在字典中写什么作为 'min_sample_leaf' 和 'min_sample_split'.

的值

我的教授告诉我要依赖数据库维度，但我不明白怎么做！

这是一个代码示例：

def model(model,params):
    r_2 = [] 
    mae_ = []
    rs= RandomizedSearchCV(model,params, cv=5, n_jobs=-1, n_iter=30)
    start = time()
    rs.fit(X_train,y_train)
    #prediction on test data
    y_pred =rs.predict(X_test)
    #R2
    r2= r2_score(y_test, y_pred).round(decimals=2)
    print('R2 on test set: %.2f' %r2)
    r_2.append(r2)
    #MAE
    mae = mean_absolute_error(y_test, y_pred).round(decimals=2)
    print('Mean absolute Error: %.2f' %mae)
    mae_.append(mae)
    #print running time
    print('RandomizedSearchCV took: %.2f' %(time() - start),'seconds')
    return r_2, mae_ 

params= {

    'min_samples_split':np.arange(), #define these two hypeparameter relying on database???
    'min_samples_leaf':np.arange()
}

DT = model(DecisionTreeRegressor(), params)

有人可以解释一下吗？

非常感谢

Answer 1

你的教授说的是检查你的数据大小，这样你就可以决定你的参数值。

对于DecisionTreeRegressor，你可以看到min_samples_split和min_samples_leaf取决于你的n_samples，也就是行数。文档对两个参数说了同样的话：

min_samples_split: int or float, default=2

The minimum number of samples required to split an internal node:

· If int, then consider min_samples_split as the minimum number.

· If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

正如文档所说，如果你想使用引用 n_samples 的参数（正如你的老师对你说的），你必须使用 floats这将代表样本数量的一小部分（0.0 到 1.0 之间）。

比如你想定义min_sample_split为100，你可以用两种方式写：简单的100或者你使用float格式0.005（你可以看到0.005*20060等于100）。

使用浮点数允许您使用独立于样本数量的值。这是优势。

无论如何，我会告诉你，你可能不会找到一些大的改进，因为默认值非常小。

这适用于 min_sample_split 和 min_samples_leaf。

Answer 2

这实际上取决于您要实现的目标。如果您的目标是拥有一个可用于生产的模型；这意味着它具有可接受的足迹，您可能需要增加这些值以确保您具有更小的树深度平均值。否则，你所有的树都会有太多的节点，你的模型会占用太多 space 内存用于推理。另一个问题可能是训练模型所需的时间。例如，可能需要大约 1000 秒来拟合具有大约 4M 样本的 RF，其尺寸 'n_estimators': 50, 'max_depth': 200, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_jobs': 20 但只有50s 'n_estimators': 50, 'max_depth': 200, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_jobs': 20 相似的准确度。

如果您只关心准确性，则可以将这些值设置得相当低，以便获得尽可能多的节点并在您的训练数据上进行广泛学习。确保仔细验证（K 折？）以避免过度拟合。

调优是一门艺术，您需要在实际应用和最佳指标得分之间取得平衡。

如何在决策树回归器中定义 'min sample split' 和 'min sample leaf'？

How to define 'min sample split' and 'min sample leaf' in decision tree regressor?

python

decision-tree

python-3.x

scikit-learn