h2o 模型如何确定用于预测的列（位置、名称等）？

Question

使用 h2o python API 训练一些模型，我对如何正确实现 API 的某些部分有点困惑。具体来说，训练数据集中哪些列应该被忽略，以及模型在实际使用模型的 predict() 方法时如何在数据集中寻找实际的预测特征。还有应该如何处理权重列（当实际预测数据集没有权重时）

这里的代码细节（我认为）并不重要，但基本的训练逻辑看起来像

drf_dx = h2o.h2o.H2ORandomForestEstimator(
    # denoting update version name by epoch timestamp
    model_id='drf_dx_v'+str(version)+'t'+str(int(time.time())), 
    response_column='dx_outcome',
    ignored_columns=[
        'ucl_id', 'patient_id', 'account_id', 'tar_id', 'charge_line', 'ML_data_begin',
        'procedure_outcome', 'provider_outcome',
        'weight'
    ],
    weights_column='weight',
    ntrees=64,
    nbins=32,
    balance_classes=True,
    binomial_double_trees=True)
.
.
.
drf_dx.train(x=X_train, y=Y_train, 
          training_frame=train_u, validation_frame=val_u, 
          max_runtime_secs=max_train_time_hrs*60*60)

（注意忽略的列）并且预测逻辑看起来像

preds = model.predict(X)

其中 X 是一些 (h2o) 数据帧，其列数 比用于训练模型的 X_train 列多（或少）（包括一些列 post-处理探索（在 Jupyter notebook 中））。例如。 X_train 列可能看起来像

<columns to ignore (as seen in the code)> <columns to use a features for training> <outcome label>

X 列可能看起来像

<columns to ignore (as seen in the code)> <EVEN MORE COLUMNS TO IGNORE> <columns to use a features for training>

我的问题是：这是否会在进行预测时混淆模型？ IE。模型是通过 列名称 获取要用作特征的列（在这种情况下，我认为不同的数据框宽度不会成为问题）还是通过 column position（在这种情况下，向每个样本添加更多数据列会改变位置并成为问题）或其他？由于这些列未在模型构造函数的 ignored_columns 参数中说明，会发生什么？

** 稍微撇开：weights_column 名称是否应该在 ignored_columns 列表中？文档中的示例 (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/weights_column.html#weights-column) 似乎将其用作预测器功能并且似乎推荐它

For scoring, all computed metrics will take the observation weights into account (for Gains/Lift, AUC, confusion matrices, logloss, etc.), so it’s important to also provide the weights column for validation or test sets if you want to up/down-weight certain observations (ideally consistently between training and testing).

但这些权重值并不是实际预测中使用的数据附带的东西。

Answer 1

我已将您的问题概括为几个不同的部分，因此答案将采用 Q/A 类型。

1).当我使用 my_model.predict(X) 时，H2O-3 如何知道要预测哪些列？

H2O-3 将使用您在构建模型时作为预测变量传递的列（即，无论您传递给估算器中的 x 参数，还是您包含在 [=61 中的所有列=] 不是：使用 ignored_columns 忽略，作为目标传递给 y 参数，因为该列具有常量值而被删除。）。我的建议是使用 x 参数来指定您的预测变量并忽略 ignore_columns 参数。如果 X，您预测的新数据框包含构建模型时未使用的列，这些列将被忽略 - 因此 列名称而不是列位置 。

2) weights 列名称是否应该在忽略的列列表中？

不，如果您将权重列传递给忽略列列表，则在模型构建阶段将不会以任何方式考虑该列。事实上，如果你对此进行测试，你应该会得到一个空指针错误或类似的错误。

3) 为什么"weights" 列被指定为预测变量并在下面code example 中被指定为weights_column？

这是一个很好的问题！我已经创建了两个 Jira 票证 one to update the documentation to clear up the confusion and another one 以潜在地添加用户警告。
简短的回答是，如果您将同一列传递给预测变量参数 x 和 weights_column 参数，列将仅用作权重 - 它不会用作特征.

4) 用户指南是否建议使用权重作为特征和权重？

不，在您指向的段落中，建议确保您作为weights_column传递的列存在于您的训练框架和验证框架 - 并不是说它也应该作为一项功能包含在内。

h2o 模型如何确定用于预测的列（位置、名称等）？

How do h2o models determine what columns to use for predictions (position, name, etc.)?

h2o