sklearn GradientBoostingRegressor中的数值跳跃
Numerical jump in sklearn GradientBoostingRegressor
我一直在研究 "hand-rolled" 版本的梯度提升回归树。我发现错误与 sklearn GradientBoostingRegressor 模块非常吻合,直到我将树构建循环增加到某个值以上。我不确定这是我的代码中的错误还是算法本身的一个特征,所以我正在寻找一些关于可能发生的事情的指导。我使用波士顿住房市场数据的完整代码清单如下所示,下面是我更改循环参数时的输出。
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
X_train,X_test, = train_test_split(X,test_size=0.2,random_state=42)
y_train,y_test, = train_test_split(y,test_size=0.2,random_state=42)
alpha = 0.5
loop = 44
yhi_1=0
ypT=0
for i in range(loop):
dt = DecisionTreeRegressor(max_depth=2, random_state=42)
ri = y_train - yhi_1
dt.fit(X_train, ri)
hi = dt.predict(X_train)
yhi = yhi_1 + alpha * hi
ypi = dt.predict(X_test)*alpha
ypT = ypT + ypi
yhi_1 = yhi
r2Loop= metrics.r2_score(y_test,ypT)
print("dtL: R^2 = ", r2Loop)
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=loop, learning_rate=alpha,random_state=42,init="zero")
gbrt.fit(X_train,y_train)
gbrt.loss
y_pred = gbrt.predict(X_test)
r2GBRT= metrics.r2_score(y_test,y_pred)
print("GBT: R^2 = ", r2GBRT)
print("R2loop - GBT: ", r2Loop - r2GBRT)
当参数loop=44
输出为
dtL: R^2 = 0.8702681499951852
GBT: R^2 = 0.8702681499951852
R2loop - GBT: 0.0
而且两人意见一致。如果我将循环参数增加到 loop=45
我得到
dtL: R^2 = 0.8726215419913225
GBT: R^2 = 0.8720222156381275
R2loop - GBT: 0.0005993263531949289
两种算法之间精度突然跳跃到 15 到 16 位小数。有什么想法吗?
我认为这里有两个差异来源。最大的一个是 DecisionTreeRegressor.fit
方法中的随机性。当您在 GradientBoostingRegressor
和所有
DecisionTreeRegressor
s,您的 DecisionTreeRegressor
训练循环不会复制 GradientBoostingRegressor
处理随机种子的方式。在你的循环中,你在每次迭代中设置种子。在 GradientBoostingRegressor.fit
方法中,种子(我假设)只在训练开始时设置一次。我已将您的代码修改如下:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
import numpy as np
X, y = load_boston(return_X_y=True)
X_train,X_test, = train_test_split(X,test_size=0.2,random_state=42)
y_train,y_test, = train_test_split(y,test_size=0.2,random_state=42)
alpha = 0.5
loop = 45
yhi_1=0
ypT=0
np.random.seed(42)
for i in range(loop):
dt = DecisionTreeRegressor(max_depth=2)
ri = y_train - yhi_1
dt.fit(X_train, ri)
hi = dt.predict(X_train)
yhi = yhi_1 + alpha * hi
ypi = dt.predict(X_test)*alpha
ypT = ypT + ypi
yhi_1 = yhi
r2Loop= metrics.r2_score(y_test,ypT)
print("dtL: R^2 = ", r2Loop)
np.random.seed(42)
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=loop, learning_rate=alpha,init="zero")
gbrt.fit(X_train,y_train)
gbrt.loss
y_pred = gbrt.predict(X_test)
r2GBRT= metrics.r2_score(y_test,y_pred)
print("GBT: R^2 = ", r2GBRT)
print("R2loop - GBT: ", r2Loop - r2GBRT)
唯一的区别在于我如何设置随机种子。我现在使用 numpy
在每个训练循环之前设置种子。通过进行此更改,我使用 loop = 45
:
获得以下输出
dtL: R^2 = 0.8720222156381277
GBT: R^2 = 0.8720222156381275
R2loop - GBT: 1.1102230246251565e-16
这是浮点错误的原因(我在第一句话中提到的另一个差异来源),对于 loop
的许多值,我看不出有任何区别。
我一直在研究 "hand-rolled" 版本的梯度提升回归树。我发现错误与 sklearn GradientBoostingRegressor 模块非常吻合,直到我将树构建循环增加到某个值以上。我不确定这是我的代码中的错误还是算法本身的一个特征,所以我正在寻找一些关于可能发生的事情的指导。我使用波士顿住房市场数据的完整代码清单如下所示,下面是我更改循环参数时的输出。
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
X_train,X_test, = train_test_split(X,test_size=0.2,random_state=42)
y_train,y_test, = train_test_split(y,test_size=0.2,random_state=42)
alpha = 0.5
loop = 44
yhi_1=0
ypT=0
for i in range(loop):
dt = DecisionTreeRegressor(max_depth=2, random_state=42)
ri = y_train - yhi_1
dt.fit(X_train, ri)
hi = dt.predict(X_train)
yhi = yhi_1 + alpha * hi
ypi = dt.predict(X_test)*alpha
ypT = ypT + ypi
yhi_1 = yhi
r2Loop= metrics.r2_score(y_test,ypT)
print("dtL: R^2 = ", r2Loop)
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=loop, learning_rate=alpha,random_state=42,init="zero")
gbrt.fit(X_train,y_train)
gbrt.loss
y_pred = gbrt.predict(X_test)
r2GBRT= metrics.r2_score(y_test,y_pred)
print("GBT: R^2 = ", r2GBRT)
print("R2loop - GBT: ", r2Loop - r2GBRT)
当参数loop=44
输出为
dtL: R^2 = 0.8702681499951852
GBT: R^2 = 0.8702681499951852
R2loop - GBT: 0.0
而且两人意见一致。如果我将循环参数增加到 loop=45
我得到
dtL: R^2 = 0.8726215419913225
GBT: R^2 = 0.8720222156381275
R2loop - GBT: 0.0005993263531949289
两种算法之间精度突然跳跃到 15 到 16 位小数。有什么想法吗?
我认为这里有两个差异来源。最大的一个是 DecisionTreeRegressor.fit
方法中的随机性。当您在 GradientBoostingRegressor
和所有
DecisionTreeRegressor
s,您的 DecisionTreeRegressor
训练循环不会复制 GradientBoostingRegressor
处理随机种子的方式。在你的循环中,你在每次迭代中设置种子。在 GradientBoostingRegressor.fit
方法中,种子(我假设)只在训练开始时设置一次。我已将您的代码修改如下:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
import numpy as np
X, y = load_boston(return_X_y=True)
X_train,X_test, = train_test_split(X,test_size=0.2,random_state=42)
y_train,y_test, = train_test_split(y,test_size=0.2,random_state=42)
alpha = 0.5
loop = 45
yhi_1=0
ypT=0
np.random.seed(42)
for i in range(loop):
dt = DecisionTreeRegressor(max_depth=2)
ri = y_train - yhi_1
dt.fit(X_train, ri)
hi = dt.predict(X_train)
yhi = yhi_1 + alpha * hi
ypi = dt.predict(X_test)*alpha
ypT = ypT + ypi
yhi_1 = yhi
r2Loop= metrics.r2_score(y_test,ypT)
print("dtL: R^2 = ", r2Loop)
np.random.seed(42)
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=loop, learning_rate=alpha,init="zero")
gbrt.fit(X_train,y_train)
gbrt.loss
y_pred = gbrt.predict(X_test)
r2GBRT= metrics.r2_score(y_test,y_pred)
print("GBT: R^2 = ", r2GBRT)
print("R2loop - GBT: ", r2Loop - r2GBRT)
唯一的区别在于我如何设置随机种子。我现在使用 numpy
在每个训练循环之前设置种子。通过进行此更改,我使用 loop = 45
:
dtL: R^2 = 0.8720222156381277
GBT: R^2 = 0.8720222156381275
R2loop - GBT: 1.1102230246251565e-16
这是浮点错误的原因(我在第一句话中提到的另一个差异来源),对于 loop
的许多值,我看不出有任何区别。