将 pROC 包与 h2o 一起使用
Use pROC package with h2o
我正在使用 h2o 包对 GBM 进行二进制分类。我想评估某个变量的预测能力,如果我是正确的,我可以通过比较具有特定变量的模型和没有特定变量的模型的 AUC 来做到这一点。
我以泰坦尼克数据集为例。
所以我的假设是:
年龄对一个人能否生存具有重要的预测价值。
df <- h2o.importFile(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
response <- "survived"
df[[response]] <- as.factor(df[[response]])
## use all other columns (except for the name) as predictors
predictorsA <- setdiff(names(df), c(response, "name"))
predictorsB <- setdiff(names(df), c(response, "name", "age"))
splits <- h2o.splitFrame(
data = df,
ratios = c(0.6,0.2), ## only need to specify 2 fractions, the 3rd is implied
destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]
gbmA <- h2o.gbm(x = predictorsA, y = response, distribution="bernoulli", training_frame = train)
gbmB <- h2o.gbm(x = predictorsB, y = response, distribution="bernoulli", training_frame = train)
## Get the AUC
h2o.auc(h2o.performance(gbmA, newdata = valid))
[1] 0.9631624
h2o.auc(h2o.performance(gbmB, newdata = test))
[1] 0.9603211
我知道 pROC 包有一个 roc.test 函数来比较两条 ROC 曲线的 AUC,我想将此函数应用于我的 h2o 模型的结果.
你可以这样做-
valid_A <- as.data.frame(h2o.predict(gbmA,valid))
valid_B <- as.data.frame(h2o.predict(gbmB,valid))
valid_df <- as.data.frame(valid)
roc1 <- roc(valid_df$survived,valid_A$p1)
roc2 <- roc(valid_df$survived,valid_B$p1)
> roc.test(roc1,roc2)
DeLong's test for two correlated ROC curves
data: roc1 and roc2
Z = -0.087489, p-value = 0.9303
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9500141 0.9504367
我正在使用 h2o 包对 GBM 进行二进制分类。我想评估某个变量的预测能力,如果我是正确的,我可以通过比较具有特定变量的模型和没有特定变量的模型的 AUC 来做到这一点。
我以泰坦尼克数据集为例。
所以我的假设是: 年龄对一个人能否生存具有重要的预测价值。
df <- h2o.importFile(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
response <- "survived"
df[[response]] <- as.factor(df[[response]])
## use all other columns (except for the name) as predictors
predictorsA <- setdiff(names(df), c(response, "name"))
predictorsB <- setdiff(names(df), c(response, "name", "age"))
splits <- h2o.splitFrame(
data = df,
ratios = c(0.6,0.2), ## only need to specify 2 fractions, the 3rd is implied
destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]
gbmA <- h2o.gbm(x = predictorsA, y = response, distribution="bernoulli", training_frame = train)
gbmB <- h2o.gbm(x = predictorsB, y = response, distribution="bernoulli", training_frame = train)
## Get the AUC
h2o.auc(h2o.performance(gbmA, newdata = valid))
[1] 0.9631624
h2o.auc(h2o.performance(gbmB, newdata = test))
[1] 0.9603211
我知道 pROC 包有一个 roc.test 函数来比较两条 ROC 曲线的 AUC,我想将此函数应用于我的 h2o 模型的结果.
你可以这样做-
valid_A <- as.data.frame(h2o.predict(gbmA,valid))
valid_B <- as.data.frame(h2o.predict(gbmB,valid))
valid_df <- as.data.frame(valid)
roc1 <- roc(valid_df$survived,valid_A$p1)
roc2 <- roc(valid_df$survived,valid_B$p1)
> roc.test(roc1,roc2)
DeLong's test for two correlated ROC curves
data: roc1 and roc2
Z = -0.087489, p-value = 0.9303
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9500141 0.9504367