XGBoost 对列表和数组的预测略有不同,哪个是正确的?
XGBoost giving slightly different predictions for list vs array, which is correct?
我注意到我正在为
传递测试特征值的双括号列表
print(test_feats)
>> [[23.0, 3.0, 35.0, 0.28, -3.0, 18.0, 0.0, 0.0, 0.0, 3.33, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 39.0, 36.0, 113.0, 76.0, 0.0, 0.0, 1.0, 0.34, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, 0.0, 25.0, 48.0, 48.0, 0.0, 29.0, 52.0, 53.0, 99.0, 368.0, 676.0, 691.0, 4.0, 9.0, 12.0, 13.0]]
我注意到当我将其传递给 XBGBoost 进行预测时 returns 当我将其转换为数组时得到不同的结果
array_test_feats = np.array(test_feats)
print(regr.predict_proba(test_feats)[:,1][0])
print(regr.predict_proba(aray_test_feats)[:,1][0])
>> 0.46929297
>> 0.5161868
一些基本检查表明值相同
print(sum(test_feats[0]) == array_test_feats.sum())
print(test_feats == array_test_feats))
>> True
>> array([[ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True]])
我猜数组是要走的路,但我真的不知道怎么说。预测非常接近,很容易被忽略,所以我真的很想知道为什么会这样。
您刚刚 运行 遇到了此处描述的问题:https://github.com/dmlc/xgboost/pull/3970
The documentation does not include lists as an allowed type for the
data inputted into DMatrix. Despite this, a list can be passed in
without an error. This change would prevent a list form being passed
in directly.
I experienced an issue where passing in a list vs a np.array resulted
in different predictions (sometimes over 10% relative difference) for
the same data. Though these differences were infrequent (~1.5% of
cases tested), in certain applications this could cause serious
issues.
本质上,在幕后发生的事情是直接传递 Python 列表在 XGBoost 中没有得到官方支持,但无论如何都可以工作,因为它在 XGBoost 的数据转换中达到了 a fall through case。
这会导致 XGBoost 使用 XGDMatrixCreateFromCSREx
函数而不是 XGDMatrixCreateFromMat
来为数据创建底层矩阵。然后在 sprase 与密集表示中的缺失元素之间存在 difference in behavior:
"Sparse" elements are treated as "missing" by the tree booster and as
zeros by the linear booster.
我注意到我正在为
传递测试特征值的双括号列表print(test_feats)
>> [[23.0, 3.0, 35.0, 0.28, -3.0, 18.0, 0.0, 0.0, 0.0, 3.33, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 39.0, 36.0, 113.0, 76.0, 0.0, 0.0, 1.0, 0.34, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, 0.0, 25.0, 48.0, 48.0, 0.0, 29.0, 52.0, 53.0, 99.0, 368.0, 676.0, 691.0, 4.0, 9.0, 12.0, 13.0]]
我注意到当我将其传递给 XBGBoost 进行预测时 returns 当我将其转换为数组时得到不同的结果
array_test_feats = np.array(test_feats)
print(regr.predict_proba(test_feats)[:,1][0])
print(regr.predict_proba(aray_test_feats)[:,1][0])
>> 0.46929297
>> 0.5161868
一些基本检查表明值相同
print(sum(test_feats[0]) == array_test_feats.sum())
print(test_feats == array_test_feats))
>> True
>> array([[ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True]])
我猜数组是要走的路,但我真的不知道怎么说。预测非常接近,很容易被忽略,所以我真的很想知道为什么会这样。
您刚刚 运行 遇到了此处描述的问题:https://github.com/dmlc/xgboost/pull/3970
The documentation does not include lists as an allowed type for the data inputted into DMatrix. Despite this, a list can be passed in without an error. This change would prevent a list form being passed in directly.
I experienced an issue where passing in a list vs a np.array resulted in different predictions (sometimes over 10% relative difference) for the same data. Though these differences were infrequent (~1.5% of cases tested), in certain applications this could cause serious issues.
本质上,在幕后发生的事情是直接传递 Python 列表在 XGBoost 中没有得到官方支持,但无论如何都可以工作,因为它在 XGBoost 的数据转换中达到了 a fall through case。
这会导致 XGBoost 使用 XGDMatrixCreateFromCSREx
函数而不是 XGDMatrixCreateFromMat
来为数据创建底层矩阵。然后在 sprase 与密集表示中的缺失元素之间存在 difference in behavior:
"Sparse" elements are treated as "missing" by the tree booster and as zeros by the linear booster.