增强回归树 - 偏差值
Boosted regression trees - deviance values
我正在使用 R 中的 gbm 包为以下模型拟合 BRT 模型:
离地高度~年龄+季节+栖息地+时间
离地高度是连续变量,时间也是。季节和栖息地是二项式变量。
我的偏差非常高,我不知道为什么...
有人可以帮我设置参数吗?
> M1 <- gbm.step(data=data, gbm.x = 2:5, gbm.y = 1,
+ family = "gaussian", tree.complexity = 4,
+ learning.rate = 0.01, bag.fraction = 0.50,
+ tolerance.method = "fixed",
+ tolerance = 0.01)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for HAG and using a family of gaussian
Using 15439 observations and 4 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 55368.22
tolerance is fixed at 0.01
ntrees resid. dev.
50 51050.65
now adding trees...
100 48935.65
150 47805.14
200 47193.43
250 46841.71
300 46631.33
350 46498.56
400 46418.58
450 46371.7
500 46336.54
550 46317.53
600 46309.25
650 46300.57
700 46296.82
750 46297
800 46299.11
850 46297.7
900 46298.34
950 46292.32
1000 46297.62
1050 46295.78
1100 46301.32
1150 46306.59
1200 46312.55
1250 46314.67
1300 46318.64
1350 46321.38
1400 46324.33
1450 46322.9
fitting final gbm model with a fixed number of 950 trees for HAG
mean total deviance = 55368.21
mean residual deviance = 45913.34
estimated cv deviance = 46292.32 ; se = 1366.501
training data correlation = 0.413
cv correlation = 0.406 ; se = 0.008
elapsed time - 0.02 minutes
gbm 中的偏差是均方误差,它取决于因变量所在的尺度。
例如:
library(dismo)
library(mlbench)
data(BostonHousing)
idx=sample(nrow(BostonHousing),400)
TrnData = BostonHousing[idx,]
TestData = BostonHousing[-idx,]
因变量是最后一列 "medv" ,所以我们 运行 原始数据上的 gbm:
gbm_0 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
mean total deviance = 84.02
mean residual deviance = 7.871
estimated cv deviance = 13.959 ; se = 1.909
training data correlation = 0.952
cv correlation = 0.916 ; se = 0.012
你可以看到平均偏差也可以从你的残差中计算出来(即 y - y 预测值):
mean(gbm_0$residuals^2)
[1] 7.871158
使用 testData(模型还没有在其上训练)总是好的。您还可以使用相关性或 MAE(平均绝对误差)检查它与实际数据的接近程度:
pred = predict(gbm_0,TestData,1000)
# or pearson if you like
cor(pred,TestData$medv,method="spearman")
[1] 0.8652737
# MAE
mean(abs(TestData$medv-pred))
[1] 2.75325
想象一下,良好的相关性表明您的预测平均偏差 3。
现在,如果您更改因变量的规模,您对相关性或 MAE 的解释所导致的偏差将保持不变:
TrnData$medv = TrnData$medv*2
TestData$medv = TestData$medv*2
gbm_2 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
mean total deviance = 336.081
mean residual deviance = 30.983
estimated cv deviance = 57.52 ; se = 10.254
training data correlation = 0.953
cv correlation = 0.911 ; se = 0.019
elapsed time - 0.2 minutes
pred = predict(gbm_2,TestData,1000)
cor(pred,TestData$medv,method="spearman")
[1] 0.8676821
mean(abs(TestData$medv-pred))
[1] 5.47673
我正在使用 R 中的 gbm 包为以下模型拟合 BRT 模型:
离地高度~年龄+季节+栖息地+时间
离地高度是连续变量,时间也是。季节和栖息地是二项式变量。
我的偏差非常高,我不知道为什么... 有人可以帮我设置参数吗?
> M1 <- gbm.step(data=data, gbm.x = 2:5, gbm.y = 1,
+ family = "gaussian", tree.complexity = 4,
+ learning.rate = 0.01, bag.fraction = 0.50,
+ tolerance.method = "fixed",
+ tolerance = 0.01)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for HAG and using a family of gaussian
Using 15439 observations and 4 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 55368.22
tolerance is fixed at 0.01
ntrees resid. dev.
50 51050.65
now adding trees...
100 48935.65
150 47805.14
200 47193.43
250 46841.71
300 46631.33
350 46498.56
400 46418.58
450 46371.7
500 46336.54
550 46317.53
600 46309.25
650 46300.57
700 46296.82
750 46297
800 46299.11
850 46297.7
900 46298.34
950 46292.32
1000 46297.62
1050 46295.78
1100 46301.32
1150 46306.59
1200 46312.55
1250 46314.67
1300 46318.64
1350 46321.38
1400 46324.33
1450 46322.9
fitting final gbm model with a fixed number of 950 trees for HAG
mean total deviance = 55368.21
mean residual deviance = 45913.34
estimated cv deviance = 46292.32 ; se = 1366.501
training data correlation = 0.413
cv correlation = 0.406 ; se = 0.008
elapsed time - 0.02 minutes
gbm 中的偏差是均方误差,它取决于因变量所在的尺度。
例如:
library(dismo)
library(mlbench)
data(BostonHousing)
idx=sample(nrow(BostonHousing),400)
TrnData = BostonHousing[idx,]
TestData = BostonHousing[-idx,]
因变量是最后一列 "medv" ,所以我们 运行 原始数据上的 gbm:
gbm_0 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
mean total deviance = 84.02
mean residual deviance = 7.871
estimated cv deviance = 13.959 ; se = 1.909
training data correlation = 0.952
cv correlation = 0.916 ; se = 0.012
你可以看到平均偏差也可以从你的残差中计算出来(即 y - y 预测值):
mean(gbm_0$residuals^2)
[1] 7.871158
使用 testData(模型还没有在其上训练)总是好的。您还可以使用相关性或 MAE(平均绝对误差)检查它与实际数据的接近程度:
pred = predict(gbm_0,TestData,1000)
# or pearson if you like
cor(pred,TestData$medv,method="spearman")
[1] 0.8652737
# MAE
mean(abs(TestData$medv-pred))
[1] 2.75325
想象一下,良好的相关性表明您的预测平均偏差 3。
现在,如果您更改因变量的规模,您对相关性或 MAE 的解释所导致的偏差将保持不变:
TrnData$medv = TrnData$medv*2
TestData$medv = TestData$medv*2
gbm_2 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
mean total deviance = 336.081
mean residual deviance = 30.983
estimated cv deviance = 57.52 ; se = 10.254
training data correlation = 0.953
cv correlation = 0.911 ; se = 0.019
elapsed time - 0.2 minutes
pred = predict(gbm_2,TestData,1000)
cor(pred,TestData$medv,method="spearman")
[1] 0.8676821
mean(abs(TestData$medv-pred))
[1] 5.47673