XGBoost 对列表和数组的预测略有不同，哪个是正确的？

Question

我注意到我正在为

传递测试特征值的双括号列表

print(test_feats)
>> [[23.0, 3.0, 35.0, 0.28, -3.0, 18.0, 0.0, 0.0, 0.0, 3.33, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 39.0, 36.0, 113.0, 76.0, 0.0, 0.0, 1.0, 0.34, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, 0.0, 25.0, 48.0, 48.0, 0.0, 29.0, 52.0, 53.0, 99.0, 368.0, 676.0, 691.0, 4.0, 9.0, 12.0, 13.0]]

我注意到当我将其传递给 XBGBoost 进行预测时 returns 当我将其转换为数组时得到不同的结果

array_test_feats = np.array(test_feats)
print(regr.predict_proba(test_feats)[:,1][0])
print(regr.predict_proba(aray_test_feats)[:,1][0])
>> 0.46929297
>> 0.5161868

一些基本检查表明值相同

print(sum(test_feats[0]) == array_test_feats.sum())
print(test_feats == array_test_feats)) 
>> True
>> array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True]])

我猜数组是要走的路，但我真的不知道怎么说。预测非常接近，很容易被忽略，所以我真的很想知道为什么会这样。

Answer 1

您刚刚运行遇到了此处描述的问题：https://github.com/dmlc/xgboost/pull/3970

The documentation does not include lists as an allowed type for the data inputted into DMatrix. Despite this, a list can be passed in without an error. This change would prevent a list form being passed in directly.

I experienced an issue where passing in a list vs a np.array resulted in different predictions (sometimes over 10% relative difference) for the same data. Though these differences were infrequent (~1.5% of cases tested), in certain applications this could cause serious issues.

本质上，在幕后发生的事情是直接传递 Python 列表在 XGBoost 中没有得到官方支持，但无论如何都可以工作，因为它在 XGBoost 的数据转换中达到了 a fall through case。

这会导致 XGBoost 使用 XGDMatrixCreateFromCSREx 函数而不是 XGDMatrixCreateFromMat 来为数据创建底层矩阵。然后在 sprase 与密集表示中的缺失元素之间存在 difference in behavior：

"Sparse" elements are treated as "missing" by the tree booster and as zeros by the linear booster.

XGBoost 对列表和数组的预测略有不同，哪个是正确的？

XGBoost giving slightly different predictions for list vs array, which is correct?

python

numpy

scikit-learn

xgboost