当我在 TreeExplainer 中包含训练数据时,为什么我得到不同的 expected_value?

Why I get different expected_value when I include the training data in TreeExplainer?

在 SHAP TreeExplainer 中包含训练数据在 scikit-learn GBT 回归器中给出了不同的 expected_value

可重现示例(运行 在 Google Colab 中):

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import shap

shap.__version__
# 0.37.0

X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)

# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172

# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])

np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])

# explainer with training data
gbt_data_explainer = shap.TreeExplainer(model=gbt, data=X_train) # specifying feature_perturbation does not change the result
gbt_data_explainer.expected_value
# -23.564797322079635

因此,包含训练数据时的预期值 gbt_data_explainer.expected_value 与未提供数据时计算的预期值 (gbt_explainer.expected_value) 大不相同。

当与(明显不同的)各自 shap_values:

一起使用时,这两种方法都是相加的并且是一致的
np.abs(gbt_explainer.expected_value + gbt_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True

np.abs(gbt_data_explainer.expected_value + gbt_data_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True

但我想知道为什么他们不提供相同的 expected_value,为什么 gbt_data_explainer.expected_value 与预测的平均值如此不同。

我在这里错过了什么?

显然,当 data 被传递时,shap 子集为 100 行,然后通过树运行这些行以重置每个节点的样本计数。因此,报告的 -23.5... 是这 100 行的平均模型输出。

data 被传递给 Independent masker,后者进行子采样:
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_tree.py#L94
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_explainer.py#L68
https://github.com/slundberg/shap/blob/v0.37.0/shap/maskers/_tabular.py#L216

运行

from shap import maskers

another_gbt_explainer = shap.TreeExplainer(
    gbt,
    data=maskers.Independent(X_train, max_samples=800),
    feature_perturbation="tree_path_dependent"
)
another_gbt_explainer.expected_value

回到

-11.534353657511172

尽管@Ben 在挖掘 data 如何通过 Independent 掩码器方面做得很好,但他的回答并没有准确显示 (1) 如何计算基值以及我们在哪里从中获取不同的基值 (2) 如何 choose/lower max_samples 参数

不同值的来源

掩码对象有一个 data 属性,用于保存掩码处理后的数据。获取gbt_explainer.expected_value中显示的值:

from shap.maskers import Independent
gbt = GradientBoostingRegressor(random_state=0)

# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172

# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])

gbt_explainer = shap.TreeExplainer(gbt, Independent(X_train,100))
gbt_explainer.expected_value
# -23.56479732207963

需要做的事情:

masker = Independent(X_train,100)
gbt.predict(masker.data).mean()
# -23.56479732207963

选择max_samples怎么样?

max_samples 设置为原始数据集长度似乎也适用于其他解释器:

import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import shap
from shap.maskers import Independent
from scipy.special import logit, expit

corpus,y = shap.datasets.imdb()
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2, random_state=7)

vectorizer = TfidfVectorizer(min_df=10)
X_train = vectorizer.fit_transform(corpus_train)

model = sklearn.linear_model.LogisticRegression(penalty="l2", C=0.1)
model.fit(X_train, y_train)

explainer = shap.Explainer(model
                           ,masker = Independent(X_train,100)
                           ,feature_names=vectorizer.get_feature_names()
                          )
explainer.expected_value
# -0.18417413671991964

这个值来自:

masker=Independent(X_train,100)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[...,1])
# array([-0.18417414])

max_samples=100 似乎有点偏离 true base_value(只是提供功能意味着):

logit(model.predict_proba(X_train.mean(0).reshape(1,-1))[:,1])
array([-0.02938039])

通过增加 max_samples 可能会合理地接近 true 基线,同时保持较低的样本数量:

masker = Independent(X_train,1000)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[:,1])
# -0.05957302658674238

因此,要获得感兴趣的解释器的基值 (1) 通过模型传递 explainer.data(或 masker.data),以及 (2) 选择 max_samples 以便 base_value 采样数据足够接近真实基值。也可以尝试观察shap importances的取值和顺序是否收敛

有些人可能会注意到,为了获得基值,有时我们会平均特征输入 (LogisticRegression),有时会输出 (GBT)