在 R 中提升分类树

Question

我正在尝试使用 R 中的 gbm 包来提升分类树，但我对从 predict 函数获得的预测类型有点困惑。

这是我的代码：

  #Load packages, set random seed
  library(gbm)
  set.seed(1)

  #Generate random data
  N<-1000
  x<-rnorm(N)
  y<-0.6^2*x+sqrt(1-0.6^2)*rnorm(N)
  z<-rep(0,N)
  for(i in 1:N){
    if(x[i]-y[i]+0.2*rnorm(1)>1.0){
      z[i]=1
    }
  }

  #Create data frame
  myData<-data.frame(x,y,z)

  #Split data set into train and test
  train<-sample(N,800,replace=FALSE)
  test<-(-train)

  #Boosting
  boost.myData<-gbm(z~.,data=myData[train,],distribution="bernoulli",n.trees=5000,interaction.depth=4)
  pred.boost<-predict(boost.myData,newdata=myData[test,],n.trees=5000,type="response")
  pred.boost

pred.boost 是一个向量，其元素来自区间 (0,1)。

我本来预计预测值为 0 或 1，因为我的响应变量 z 也由二分值组成 - 0 或 1 - 我正在使用 distribution="bernoulli".

我应该如何进行预测以获得测试数据集的真实分类？我应该简单地四舍五入 pred.boost 值还是我对 predict 函数做错了什么？

Answer 1

您观察到的行为是正确的。来自文档：

If type="response" then gbm converts back to the same scale as the outcome. Currently the only effect this will have is returning probabilities for bernoulli.

所以你应该在使用 type="response" 时得到概率，这是正确的。加 distribution="bernoulli" 只是告诉标签遵循伯努利 (0/1) 模式。你可以忽略它，模型仍然运行没问题。

要继续执行 predict_class <- pred.boost > 0.5（截止值 = 0.5），否则绘制 ROC 曲线以自行决定截止值。

Answer 2

尝试使用 adabag。 Class，概率，投票和错误都内置在adabag中，这使得它易于解释，当然代码行也更少。

在 R 中提升分类树

Boosting classification tree in R

r

classification

boosting