为什么具有零 return 数据集的 XGBoost 是非零预测?

Why does XGBoost with datasets of zeros return a non-zero prediction?

我最近使用 scikit-learn RandomForestRegressor 模型开发了一个功能齐全的随机森林回归软件,现在我有兴趣将其性能与其他库进行比较。 所以我找到了一个 scikit-learn API for XGBoost random forest regression 并用 X 特征和全为零的 Y 数据集做了一个小软件测试。

from numpy import array
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor


tree_number = 100
depth = 10
jobs = 1
dimension = 19
sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,
                               n_jobs=jobs)
xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,
                         n_jobs=jobs)
dataset = array([[0.0] * dimension, [0.0] * dimension])
y_val = array([0.0, 0.0])

sk_VAL.fit(dataset, y_val)
xgb_VAL.fit(dataset, y_val)
sk_predict = sk_VAL.predict(array([[0.0] * dimension]))
xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))
print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))

令人惊讶的是,xgb_VAL 模型的全零输入样本的预测结果是非零的:

sk_prediction = [0.]
xgb_prediction = [0.02500369]

我的评估或比较的构建有什么错误?

XGBoost 似乎在模型中包含了全局偏差,并且该偏差固定为 0.5,而不是根据输入数据计算得出。这已作为 XGBoost GitHub 存储库中的一个问题提出(参见 https://github.com/dmlc/xgboost/issues/799)。相应的超参数是 base_score,如果您将其设置为零,您的模型将按预期预测为零。

from numpy import array
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor

tree_number = 100
depth = 10
jobs = 1
dimension = 19

sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42, n_jobs=jobs)
xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, base_score=0, random_state=42, n_jobs=jobs)

dataset = array([[0.0] * dimension, [0.0] * dimension])
y_val = array([0.0, 0.0])

sk_VAL.fit(dataset, y_val)
xgb_VAL.fit(dataset, y_val)

sk_predict = sk_VAL.predict(array([[0.0] * dimension]))
xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))

print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))
#sk_prediction = [0.]
#xgb_prediction = [0.]