为什么我无法使用随机森林找到最低平均绝对误差？

Question

我正在使用以下数据集进行 Kaggle 竞赛：https://www.kaggle.com/c/home-data-for-ml-course/download/train.csv

根据该理论，通过增加随机森林模型中估计量的数量，平均绝对误差只会下降到某个数量（最佳点），进一步增加会导致过度拟合。通过绘制估计量的数量和平均绝对误差，我们应该得到这张红色图表，最低点标志着估计量的最佳数量。

我尝试使用以下代码找到最佳估计量，但数据图显示 MAE 不断下降。我做错了什么？

train_data = pd.read_csv('train.csv')
y = train_data['SalePrice']
#for simplicity dropping all columns with missing values and non-numerical values
X = train_data.drop('SalePrice', axis=1).dropna(axis=1).select_dtypes(['number'])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
mae_list = []
for n_estimators in range(10, 800, 10):
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=0, n_jobs=8)
    rf_model.fit(X_train, y_train)
    preds = rf_model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    mae_list.append({'n_est': n_estimators, 'mae': mae})

#plotting the results
plt.plot([item['n_est'] for item in mae_list], [item['mae'] for item in mae_list])

Answer 1

你不一定做错了。

更仔细地观察您显示的理论曲线，您会注意到水平轴不包含任何 trees/iterations 实际数量的指示，而这种最小值应该发生。这是此类理论预测的一个相当普遍的特征——它们告诉你一些事情是预期的，但没有告诉你确切（或什至粗略）你应该期待它的地方。

牢记这一点，我可以从你的第二个情节中得出的唯一结论是，在你尝试过的 ~ 800 棵树的特定范围内，你实际上仍然处于预期最小值的 "left" .

同样，对于在达到最小值之前应该添加多少棵树（800 或 8,000 或...）没有理论预测。

为了在讨论中提供一些经验证据：在我自己的第一个 Kaggle 竞赛中，我们不断添加树，直到达到 ~ 24,000 的数量，然后我们的验证错误才开始发散（我们使用的是 GBM 而不是 RF，但原理是相同的）。

为什么我无法使用随机森林找到最低平均绝对误差？

Why I can not find lowest mean absolute error using Random Forest?

python

machine-learning

random-forest

scikit-learn

kaggle