更新到 3.18 后估计 h2o 中的 xgboost 时出错

Question

我遇到了无法保存 xgboost 模型并稍后加载它以获得预测的已知问题，据说它在 h2o 3.18 中已更改（问题出现在 3.16 中）。我从 h2o 的网站（可下载的 zip）更新了包，现在没有问题的模型给出了以下错误：

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = urlSuffix,  : 
  Unexpected CURL error: Failed to connect to localhost port 54321: Connection refused

这仅适用于 xgboost（二进制分类），因为我使用的其他模型工作正常。当然 h2o 是初始化的，之前的模型估计没有问题。有谁知道可能是什么问题？

编辑：这是一个产生错误的可重现示例（基于 Erin 的回答）：

library(h2o)
library(caret)
h2o.init()

# Import a sample binary outcome train set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")

# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)

# Assigning fold column
set.seed(1)
cv_folds <- createFolds(as.data.frame(train)$response,
                        k = 5,
                        list = FALSE,
                        returnTrain = FALSE)

# version 1
train <- train %>%
    as.data.frame() %>% 
    mutate(fold_assignment = cv_folds) %>%
    as.h2o()

# version 2
train <- h2o.cbind(train, as.h2o(cv_folds))
names(train)[dim(train)[2]] <- c("fold_assignment")


# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])

xgb <- h2o.xgboost(x = x,
                   y = y, 
                   seed = 1,
                   training_frame = train,
                   fold_column = "fold_assignment",
                   keep_cross_validation_predictions = TRUE,
                   eta = 0.01,
                   max_depth = 3,
                   sample_rate = 0.8,
                   col_sample_rate = 0.6,
                   ntrees = 500,
                   reg_lambda = 0,
                   reg_alpha = 1000,
                   distribution = 'bernoulli')

创建序列的两个版本 data.frame 导致相同的错误。

Answer 1

你没有说你是否有 re-trained 使用 3.18 的模型。一般来说，H2O 只保证 H2O 主要版本之间的模型兼容性。如果您没有重新训练模型，这可能是 XGBoost 无法正常工作的原因。如果您有 re-trained 3.18 的模型并且 XGBoost 仍然无法正常工作，那么请 post 一个可重现的示例，我们将进一步检查。

编辑： 我正在添加可重现的示例（与您的代码和这段代码的唯一区别是我在这里没有使用 fold_column）。这在 3.18.0.2 上运行良好。没有产生错误的可重现示例，我无法进一步帮助您。

library(h2o)
h2o.init()

# Import a sample binary outcome train set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")

# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)

# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])

xgb <- h2o.xgboost(x = x,
                   y = y, 
                   seed = 1,
                   training_frame = train,
                   keep_cross_validation_predictions = TRUE,
                   eta = 0.01,
                   max_depth = 3,
                   sample_rate = 0.8,
                   col_sample_rate = 0.6,
                   ntrees = 500,
                   reg_lambda = 0,
                   reg_alpha = 1000,
                   distribution = 'bernoulli')

更新到 3.18 后估计 h2o 中的 xgboost 时出错

Error while estimating xgboost in h2o after update to 3.18

io

r

h2o

xgboost