随机森林多类的 SHAP TreeExplainer:什么是 shap_values[i]?
SHAP TreeExplainer for RandomForest multiclass: what is shap_values[i]?
我正在尝试绘制 SHAP
这是我的代码 rnd_clf
是 RandomForestClassifier
:
import shap
explainer = shap.TreeExplainer(rnd_clf)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values[1], X)
我知道 shap_values[0]
是负数,shap_values[1]
是正数。
但是对于多个 class RandomForestClassifier 呢?我有 rnd_clf
class 验证之一:
['Gusto','Kestrel 200 SCI Older Road Bike', 'Vilano Aluminum Road Bike 21 Speed Shimano', 'Fixie'].
如何确定 shap_values[i]
的哪个索引对应于输出的哪个 class?
How do I determine which index of shap_values[i] corresponds to which class of my output?
shap_values[i]
是第 i 个 class 的 SHAP 值。什么是第 i 个 class 更多是关于您使用的编码模式的问题:LabelEncoder
、pd.factorize
等
您可以尝试以下作为线索:
from sklearn.preprocessing import LabelEncoder
labels = [
"Gusto",
"Kestrel 200 SCI Older Road Bike",
"Vilano Aluminum Road Bike 21 Speed Shimano",
"Fixie",
]
le = LabelEncoder()
y = le.fit_transform(labels)
encoding_scheme = dict(zip(y, labels))
pprint(encoding_scheme)
{0: 'Fixie',
1: 'Gusto',
2: 'Kestrel 200 SCI Older Road Bike',
3: 'Vilano Aluminum Road Bike 21 Speed Shimano'}
因此,例如 shap_values[3]
对于这种特殊情况是针对 'Vilano Aluminum Road Bike 21 Speed Shimano'
为了进一步了解如何解释 SHAP 值,让我们为 multiclass classification 准备一个具有 100 个特征和 10 个 classes:
的合成数据集
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from shap import summary_plot
X, y = make_classification(1000, 100, n_informative=8, n_classes=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
(750, 100)
此时我们有包含 750 行、100 个特征和 10 个 classes 的训练数据集。
让我们训练 RandomForestClassifier
并将其提供给 TreeExplainer
:
clf = RandomForestClassifier(n_estimators=100, max_depth=3)
clf.fit(X_train, y_train)
explainer = TreeExplainer(clf)
shap_values = np.array(explainer.shap_values(X_train))
print(shap_values.shape)
(10, 750, 100)
10 : number of classes. All SHAP values are organized into 10 arrays, 1 array per class.
750 : number of datapoints. We have local SHAP values per datapoint.
100 : number of features. We have SHAP value per every feature.
例如,对于 Class 3
你将有:
print(shap_values[3].shape)
(750, 100)
750: SHAP values for every datapoint
100: SHAP value contributions for every feature
最后,您可以 运行 进行完整性检查,以确保模型的真实预测与 shap
预测的相同。
为此,我们将 (1) 交换 shap_values
的前两个维度,(2) 对所有特征的每个 class 的 SHAP 值求和,(3) 添加 SHAP 值基值:
shap_values_ = shap_values.transpose((1,0,2))
np.allclose(
clf.predict_proba(X_train),
shap_values_.sum(2) + explainer.expected_value
)
True
然后您可以继续 summary_plot
,这将显示基于每个 class 的 SHAP 值的特征排名。对于 class 3,这将是:
summary_plot(shap_values[3],X_train)
其中解读如下:
For class 3 most influential features based on SHAP contributions are 16,59,24
For feature 15 lower values tend to result in higher SHAP values (hence higher probability of the class label)
Features 50, 45, 48 are least influential out of 20 displayed
我正在尝试绘制 SHAP
这是我的代码 rnd_clf
是 RandomForestClassifier
:
import shap
explainer = shap.TreeExplainer(rnd_clf)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values[1], X)
我知道 shap_values[0]
是负数,shap_values[1]
是正数。
但是对于多个 class RandomForestClassifier 呢?我有 rnd_clf
class 验证之一:
['Gusto','Kestrel 200 SCI Older Road Bike', 'Vilano Aluminum Road Bike 21 Speed Shimano', 'Fixie'].
如何确定 shap_values[i]
的哪个索引对应于输出的哪个 class?
How do I determine which index of shap_values[i] corresponds to which class of my output?
shap_values[i]
是第 i 个 class 的 SHAP 值。什么是第 i 个 class 更多是关于您使用的编码模式的问题:LabelEncoder
、pd.factorize
等
您可以尝试以下作为线索:
from sklearn.preprocessing import LabelEncoder
labels = [
"Gusto",
"Kestrel 200 SCI Older Road Bike",
"Vilano Aluminum Road Bike 21 Speed Shimano",
"Fixie",
]
le = LabelEncoder()
y = le.fit_transform(labels)
encoding_scheme = dict(zip(y, labels))
pprint(encoding_scheme)
{0: 'Fixie',
1: 'Gusto',
2: 'Kestrel 200 SCI Older Road Bike',
3: 'Vilano Aluminum Road Bike 21 Speed Shimano'}
因此,例如 shap_values[3]
对于这种特殊情况是针对 'Vilano Aluminum Road Bike 21 Speed Shimano'
为了进一步了解如何解释 SHAP 值,让我们为 multiclass classification 准备一个具有 100 个特征和 10 个 classes:
的合成数据集from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from shap import summary_plot
X, y = make_classification(1000, 100, n_informative=8, n_classes=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
(750, 100)
此时我们有包含 750 行、100 个特征和 10 个 classes 的训练数据集。
让我们训练 RandomForestClassifier
并将其提供给 TreeExplainer
:
clf = RandomForestClassifier(n_estimators=100, max_depth=3)
clf.fit(X_train, y_train)
explainer = TreeExplainer(clf)
shap_values = np.array(explainer.shap_values(X_train))
print(shap_values.shape)
(10, 750, 100)
10 : number of classes. All SHAP values are organized into 10 arrays, 1 array per class.
750 : number of datapoints. We have local SHAP values per datapoint.
100 : number of features. We have SHAP value per every feature.
例如,对于 Class 3
你将有:
print(shap_values[3].shape)
(750, 100)
750: SHAP values for every datapoint
100: SHAP value contributions for every feature
最后,您可以 运行 进行完整性检查,以确保模型的真实预测与 shap
预测的相同。
为此,我们将 (1) 交换 shap_values
的前两个维度,(2) 对所有特征的每个 class 的 SHAP 值求和,(3) 添加 SHAP 值基值:
shap_values_ = shap_values.transpose((1,0,2))
np.allclose(
clf.predict_proba(X_train),
shap_values_.sum(2) + explainer.expected_value
)
True
然后您可以继续 summary_plot
,这将显示基于每个 class 的 SHAP 值的特征排名。对于 class 3,这将是:
summary_plot(shap_values[3],X_train)
其中解读如下:
For class 3 most influential features based on SHAP contributions are 16,59,24
For feature 15 lower values tend to result in higher SHAP values (hence higher probability of the class label)
Features 50, 45, 48 are least influential out of 20 displayed