针对多类 Brier 分数而不是准确性优化分类器

Question

我更感兴趣的是使用 Brier 分数而不是准确性来优化我的多类问题。为此，我正在使用 predict_proba() 的结果评估我的分类器，例如：

import numpy as np

probs = np.array(
[    [1, 0, 0],
    [0, 1, 0],
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1],
    [1, 0, 0],
    [0, 0, 1],
    [0, 0, 1]]
)

targets = np.array(
    [[0.9, 0.05, 0.05],
    [0.1, 0.8, 0.1],
    [0.7, 0.2, 0.1],
    [0.1, 0.9, 0],
    [0, 0, 1],
    [0.5, 0.3, 0.2],
    [0.1, 0.5, 0.4],
    [0.34, 0.33, 0.33]]
)


def brier_multi(targets, probs):
    return np.mean(np.sum((probs - targets) ** 2, axis=1))


brier_multi(targets, probs)

是否可以在多类 Brier 分数训练期间直接优化 scikit-learns 分类器而不是准确性？

编辑：

...
pipe = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("selector", None),
        ("classifier", model.get("classifier")),
    ]
)

def brier_multi(targets, probs):
    ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
    return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))

brier_multi_loss = make_scorer(
    brier_multi,
    greater_is_better=False,
    needs_proba=True,
)

search = GridSearchCV(
    estimator=pipe,
    param_grid=model.get("param_grid"),
    scoring=brier_multi_loss,
    cv=3,
    n_jobs=-1,
    refit=True,
    verbose=3,
)
search.fit(X_train, y_train)
...

得分为 nan

/home/andreas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py:969: UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan]
  warnings.warn(

Answer 1

您已经知道 scoring 参数，所以您只需要将 brier_multi 包装成 GridSearchCV 期望的格式。有一个实用程序，make_scorer:

from sklearn.metrics import make_scorer

neg_mc_brier_score = make_scorer(
    brier_multi,
    greater_is_better=False,
    needs_proba=True,
)

GridSearchCV(..., scoring=neg_mc_brier_score)

参见User Guide and the docs for make_scorer。

不幸的是，这不会运行，因为你的记分器版本需要一个 one-hot-encoded 目标数组，而 sklearn multiclass 将 y_true 作为一维数组发送。作为确保其余工作正常的 hack，您可以修改：

def brier_multi(targets, probs):
    ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
    return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))

但我鼓励您使它更强大（如果类不只是 0, 1, ..., n_classes-1 怎么办？）。

对于它的价值，sklearn 正在进行 PR 以添加多类 Brier 分数：https://github.com/scikit-learn/scikit-learn/pull/22046（请务必查看链接的 PR18699，因为它已开始开发和审查）。

针对多类 Brier 分数而不是准确性优化分类器

Optimize classifier for multiclass Brier score instead of accuracy

python-3.x

scikit-learn

multiclass-classification