如何正确使用不可见数据的模型解释器？

Question

我使用管道训练了我的分类器：

param_tuning = {

        'classifier__learning_rate': [0.01, 0.1],
        'classifier__max_depth': [3, 5, 7, 10],
        'classifier__min_child_weight': [1, 3, 5],
        'classifier__subsample': [0.5, 0.7],
        'classifier__n_estimators' : [100, 200, 500],
    }

cat_pipe = Pipeline(
    [
        ('selector', ColumnSelector(categorical_features)),
        ('encoder', ce.one_hot.OneHotEncoder())
    ]
)

num_pipe = Pipeline(
    [
        ('selector', ColumnSelector(numeric_features)),
        ('scaler', StandardScaler())
    ]
)

preprocessor = FeatureUnion(
    transformer_list=[

        ('cat', cat_pipe),
        ('num', num_pipe)
    ]
)

xgb_pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', xgb.XGBClassifier())
    ]
)

grid = GridSearchCV(xgb_pipe, param_tuning, cv=5, n_jobs=-1, scoring='accuracy')

xgb_model = grid.fit(X_train, y_train)

训练数据有分类数据，所以转换后的数据形状是(x , 100 )。之后，我尝试解释对未见数据的模型预测。由于我将单个未见过的示例直接传递给模型，因此它以 (x, 15) 的形式对其进行了预处理（因为单个观察不包含所有示例的所有分类数据）。

eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))

我得到了

ValueError: Shape of passed values is (1, 15), indices imply (1, 100).

发生这种情况是因为模型是在形状为 (x, 100) 的整个预处理数据集上训练的，但我传递给解释器的形状为 (1,15) 的单一观察。我如何正确地将看不见的单一观察结果传递给解释器？

Answer 1

我们从不对未见数据使用.fit_transform()；正确的方法是使用预处理器的 .transform() 方法已经适合你的训练数据（这里 xgb['preprocessor']）。这样，我们确保（转换后的）看不见的数据与我们的（转换后的）训练数据具有相同的特征，因此它们与用后者构建的模型兼容。

因此，您应该在此处替换 .fit_transform(df)：

eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))

与 .transform(df):

eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].transform(df), columns = xgb['classifier'].get_booster().feature_names))

如何正确使用不可见数据的模型解释器？

How to correctly use model explainer with unseen data?

python

machine-learning

scikit-learn

eli5