具有交叉验证的 DRF 检查点失败并出现错误 "ERRR: _weights_column: Weights column '__internal_cv_weights__' not found in the training frame"

Question

已在 28.0.2 和最新的 30.0.1 版本上试用。

创建第一个 DRF：

rf1 <- h2o.randomForest(
  model_id="first_drf1_x1",
  x = f2,
  y = r1,
  training_frame = train1,
  validation_frame = valid1,
  ntrees = 49,
  nfolds = 5,
 seed = 1
)

对其进行训练，然后他们尝试像这样从该模型继续训练：

rf2 <- h2o.randomForest(
  model_id="second_drf1_x2",
  x = f2,
  y = r1,
  training_frame = train2,
  validation_frame = valid2,
  ntrees = (49+50),
  nfolds = 5,
  checkpoint = "first_drf1_x1",
  seed = 1

)

立即在日志中可以看到：

POST /3/ModelBuilders/drf, parms: {model_id=second_drf1_x2, validation_frame=RTMP_sid_aea1_16, response_column=pcs7_e_dt_4010u, training_frame=RTMP_sid_aea1_14, seed=1, nfolds=5, ntrees=99, ignored_columns=["ts","leve_batch_nbr"], checkpoint=first_drf1_x1}
04-30 10:20:34.601 127.0.0.1:54321       55804  FJ-1-5    INFO: Creating 5 cross-validation splits with random number seed: 1
04-30 10:20:34.612 127.0.0.1:54321       55804  FJ-1-5    ERRR: _weights_column: Weights column '__internal_cv_weights__' not found in the training frame

创建第一个模型时，创建了 5 个 CV 模型，它们的内部字段设置如下：

“_weights_column":"internal_cv_weights",

但是当第一个主要模型训练完成后：

Building main model.
...
“_weights_column":null,

我在 h2o jira 中发现了一个错误，但也许有人已经看到了这个问题并找到了解决方法。如果 nfolds 设置为 0（禁用交叉验证）——那么一切正常

Answer 1

您需要禁用 nfolds。正如 docs 所说 "Cross-validation is not currently supported for checkpointing."

如果您使用的是新数据，从旧模型开始进行 DRF 可能意义不大。 old/original 树 (1-49) 不会从新数据的额外观察中获益。检查点 (50-99) 之后的新树将有额外的观察结果。因此，您的一半树将缺少一些预测信息，这可能会在您的评分中产生一些偏差。

具有交叉验证的 DRF 检查点失败并出现错误 "ERRR: _weights_column: Weights column '__internal_cv_weights__' not found in the training frame"

DRF checkpoint with crossvalidation fails with error "ERRR: _weights_column: Weights column '__internal_cv_weights__' not found in the training frame"

h2o