R:为什么 gbm 在泰坦尼克号数据上给出 NA 值?

R: why does gbm give NA values on Titanic data?

我有经典的泰坦尼克数据。这里是清理数据的描述。

> str(titanic)
'data.frame':   887 obs. of  7 variables:
 $ Survived               : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
 $ Pclass                 : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Sex                    : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age                    : num  22 38 26 35 35 27 54 2 27 14 ...
 $ Siblings.Spouses.Aboard: int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parents.Children.Aboard: int  0 0 0 0 0 0 0 1 2 0 ...
 $ Fare                   : num  7.25 71.28 7.92 53.1 8.05 ...

我先拆分数据

set.seed(123)
train_ind <- sample(seq_len(nrow(titanic)), size = smp_size)
train <- titanic[train_ind, ]
test <- titanic[-train_ind, ]

然后我将 Survived 列更改为 0 和 1。

train$Survived <- as.factor(ifelse(train$Survived == 'Yes', 1, 0))
test$Survived <- as.factor(ifelse(test$Survived == 'Yes', 1, 0))

最后,我运行梯度提升算法。

dt_gb <- gbm(Survived ~ ., data = train)

这是结果。

> print(dt_gb)
gbm(formula = Survived ~ ., data = train)
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 6 predictors of which 0 had non-zero influence.

由于有 0 个预测变量具有非零影响,因此预测为 NA。我想知道为什么会这样?我的代码有什么问题吗?

避免在训练和测试数据中将 Survival 转换为 0/1 因子。相反,将 Survival 列更改为 numeric 类型的 0/1 向量。

# e.g. like this
titanic$Survival <- as.numeric(titantic$Survival) - 1

# data should look like this
> str(titanic)
'data.frame':   887 obs. of  7 variables:
$ Survived               : num  0 1 1 1 0 0 0 0 1 1 ...
$ Pclass                 : int  3 1 3 1 3 3 1 3 3 2 ...
$ Sex                    : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age                    : num  22 38 26 35 35 27 54 2 27 14 ...
$ Siblings.Spouses.Aboard: int  1 1 0 1 0 0 0 3 0 1 ...
$ Parents.Children.Aboard: int  0 0 0 0 0 0 0 1 2 0 ...
$ Fare                   : num  7.25 71.28 7.92 53.1 8.05 ...

然后用伯努利损失拟合模型。

dt_gb <- gbm::gbm(formula = Survived ~ ., data = titanic, 
                  distribution = "bernoulli")

> print(dt_gb)
gbm::gbm(formula = Survived ~ ., distribution = "bernoulli", 
    data = titanic)
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 6 predictors of which 6 had non-zero influence.

获取前几名乘客的预测生存概率:

>head(predict(dt_gb, type = "response"))
[1] 0.1200703 0.9024225 0.5875393 0.9271306 0.1200703 0.1200703