如何以概率输出Shap值并使二元分类器force_plot
How to output Shap values in probability and make force_plot from binary classifier
我需要绘制每个特征如何影响来自 LightGBM
二元分类器的每个样本的预测概率。所以我需要输出概率的 Shap 值,而不是正常的 Shap 值。它似乎没有任何概率输出选项。
下面的示例代码是我用来生成 Shap 值的数据帧并对第一个数据样本执行 force_plot
的代码。有谁知道我应该如何修改代码来改变输出?
我是 Shap 值和 Shap 包的新手。提前致谢。
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer.expected_value[class_idx]
shap_value = shap_values[:,:,class_idx].values[row_idx]
shap.force_plot (base_value = expected_value, shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# dataframe of shap values for class 1
shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
您可以通过 softmax() 函数考虑 运行 您的输出值。作为参考,它定义为:
def get_softmax_probabilities(x):
return np.exp(x) / np.sum(np.exp(x)).reshape(-1, 1)
还有一个 scipy 实现:
from scipy.special import softmax
softmax() 的输出将是与向量 x 中的(相对)值成比例的概率,即您的商店价值。
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
# plot
# shap.summary_plot(shap_values[class_idx], X_train, plot_type='bar')
# shap.summary_plot(shap_values[class_idx], X_train)
# shap_value = shap_values[:,:,class_idx].values[row_idx]
# shap.force_plot (base_value = expected_value, shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# # dataframe of shap values for class 1
# shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
# verification
def verification(index_number,class_idx):
print('-----------------------------------')
print('index_number: ', index_number)
print('class_idx: ', class_idx)
print('')
y_base = explainer.expected_value[class_idx]
print('y_base: ', y_base)
player_explainer = pd.DataFrame()
player_explainer['feature_value'] = X_train.iloc[j].values
player_explainer['shap_value'] = shap_values[class_idx][j]
print('verification: ')
print('y_base + sum_of_shap_values: %.2f'%(y_base + player_explainer['shap_value'].sum()))
print('y_pred: %.2f'%(y_train[j]))
j = 10 # index
verification(j,0)
verification(j,1)
# show:
# X_train: (455, 30)
# X_test: (114, 30)
# -----------------------------------
# index_number: 10
# class_idx: 0
# y_base: -2.391423081639827
# verification:
# y_base + sum_of_shap_values: -9.40
# y_pred: 1.00
# -----------------------------------
# index_number: 10
# class_idx: 1
# y_base: 2.391423081639827
# verification:
# y_base + sum_of_shap_values: 9.40
# y_pred: 1.00
# -9.40,9.40 takes the maximum value(class_idx:1 = y_pred), and the result is obviously correct.
我帮你实现了,验证了结果的可靠性
长话短说:
在force_plot
方法中用link="logit"
可以得到概率space的绘图结果:
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit
shap.initjs()
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer_raw = shap.TreeExplainer(model)
shap_values = explainer_raw(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer_raw.expected_value[class_idx]
shap_value = shap_values[:, :, class_idx].values[row_idx]
shap.force_plot(
base_value=expected_value,
shap_values=shap_value,
features=X_train.iloc[row_idx, :],
link="logit",
)
预期输出:
或者,您可以通过以下方式实现相同的效果,明确指定 model_output="probability"
您感兴趣的解释:
explainer = shap.TreeExplainer(
model,
data=X_train,
feature_perturbation="interventional",
model_output="probability",
)
shap_values = explainer(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
shap_value = shap_values.values[row_idx]
shap.force_plot(
base_value=expected_value,
shap_values=shap_value,
features=X_train.iloc[row_idx, :]
)
预期输出:
但是,了解这些数字的来源可能更有趣:
- 我们的兴趣点目标概率:
model_proba= model.predict_proba(X_train.iloc[[row_idx]])
model_proba
# array([[0.00275887, 0.99724113]])
- 来自给定
X_train
作为背景的模型的原始基本案例(注意,LightGBM
为 class 1
输出原始数据):
model.predict(X_train, raw_score=True).mean()
# 2.4839751932445577
- 来自
SHAP
的原始基本案例(注意,它们是对称的):
bv = explainer_raw(X_train).base_values[0]
bv
# array([-2.48397519, 2.48397519])
- 兴趣点的原始
SHAP
值:
sv_0 = explainer_raw(X_train).values[row_idx].sum(0)
sv_0
# array([-3.40619584, 3.40619584])
- Proba 从
SHAP
值推断(通过 sigmoid):
shap_proba = expit(bv + sv_0)
shap_proba
# array([0.00275887, 0.99724113])
- 检查:
assert np.allclose(model_proba, shap_proba)
有什么不明白的地方请提问
旁注
Proba might be misleading if you're analyzing raw size effect of different features because sigmoid is non-linear and saturates after reaching certain threshold.
Some people expect to see SHAP values in probability space as well, but this is not feasible because:
- SHAP values are additive by construction (to be precise SHapley Additive exPlanations are average marginal contributions over all possible feature coalitions)
- exp(a + b) != exp(a) + exp(b)
您可能会觉得有用:
二进制文件中的特征重要性 class化和提取 SHAP 值仅用于 classes 之一
GBTclassifier的base_value使用SHAP时如何解读?
我需要绘制每个特征如何影响来自 LightGBM
二元分类器的每个样本的预测概率。所以我需要输出概率的 Shap 值,而不是正常的 Shap 值。它似乎没有任何概率输出选项。
下面的示例代码是我用来生成 Shap 值的数据帧并对第一个数据样本执行 force_plot
的代码。有谁知道我应该如何修改代码来改变输出?
我是 Shap 值和 Shap 包的新手。提前致谢。
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer.expected_value[class_idx]
shap_value = shap_values[:,:,class_idx].values[row_idx]
shap.force_plot (base_value = expected_value, shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# dataframe of shap values for class 1
shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
您可以通过 softmax() 函数考虑 运行 您的输出值。作为参考,它定义为:
def get_softmax_probabilities(x):
return np.exp(x) / np.sum(np.exp(x)).reshape(-1, 1)
还有一个 scipy 实现:
from scipy.special import softmax
softmax() 的输出将是与向量 x 中的(相对)值成比例的概率,即您的商店价值。
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
# plot
# shap.summary_plot(shap_values[class_idx], X_train, plot_type='bar')
# shap.summary_plot(shap_values[class_idx], X_train)
# shap_value = shap_values[:,:,class_idx].values[row_idx]
# shap.force_plot (base_value = expected_value, shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# # dataframe of shap values for class 1
# shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
# verification
def verification(index_number,class_idx):
print('-----------------------------------')
print('index_number: ', index_number)
print('class_idx: ', class_idx)
print('')
y_base = explainer.expected_value[class_idx]
print('y_base: ', y_base)
player_explainer = pd.DataFrame()
player_explainer['feature_value'] = X_train.iloc[j].values
player_explainer['shap_value'] = shap_values[class_idx][j]
print('verification: ')
print('y_base + sum_of_shap_values: %.2f'%(y_base + player_explainer['shap_value'].sum()))
print('y_pred: %.2f'%(y_train[j]))
j = 10 # index
verification(j,0)
verification(j,1)
# show:
# X_train: (455, 30)
# X_test: (114, 30)
# -----------------------------------
# index_number: 10
# class_idx: 0
# y_base: -2.391423081639827
# verification:
# y_base + sum_of_shap_values: -9.40
# y_pred: 1.00
# -----------------------------------
# index_number: 10
# class_idx: 1
# y_base: 2.391423081639827
# verification:
# y_base + sum_of_shap_values: 9.40
# y_pred: 1.00
# -9.40,9.40 takes the maximum value(class_idx:1 = y_pred), and the result is obviously correct.
我帮你实现了,验证了结果的可靠性
长话短说:
在force_plot
方法中用link="logit"
可以得到概率space的绘图结果:
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit
shap.initjs()
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer_raw = shap.TreeExplainer(model)
shap_values = explainer_raw(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer_raw.expected_value[class_idx]
shap_value = shap_values[:, :, class_idx].values[row_idx]
shap.force_plot(
base_value=expected_value,
shap_values=shap_value,
features=X_train.iloc[row_idx, :],
link="logit",
)
预期输出:
或者,您可以通过以下方式实现相同的效果,明确指定 model_output="probability"
您感兴趣的解释:
explainer = shap.TreeExplainer(
model,
data=X_train,
feature_perturbation="interventional",
model_output="probability",
)
shap_values = explainer(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
shap_value = shap_values.values[row_idx]
shap.force_plot(
base_value=expected_value,
shap_values=shap_value,
features=X_train.iloc[row_idx, :]
)
预期输出:
但是,了解这些数字的来源可能更有趣:
- 我们的兴趣点目标概率:
model_proba= model.predict_proba(X_train.iloc[[row_idx]])
model_proba
# array([[0.00275887, 0.99724113]])
- 来自给定
X_train
作为背景的模型的原始基本案例(注意,LightGBM
为 class1
输出原始数据):
model.predict(X_train, raw_score=True).mean()
# 2.4839751932445577
- 来自
SHAP
的原始基本案例(注意,它们是对称的):
bv = explainer_raw(X_train).base_values[0]
bv
# array([-2.48397519, 2.48397519])
- 兴趣点的原始
SHAP
值:
sv_0 = explainer_raw(X_train).values[row_idx].sum(0)
sv_0
# array([-3.40619584, 3.40619584])
- Proba 从
SHAP
值推断(通过 sigmoid):
shap_proba = expit(bv + sv_0)
shap_proba
# array([0.00275887, 0.99724113])
- 检查:
assert np.allclose(model_proba, shap_proba)
有什么不明白的地方请提问
旁注
Proba might be misleading if you're analyzing raw size effect of different features because sigmoid is non-linear and saturates after reaching certain threshold.
Some people expect to see SHAP values in probability space as well, but this is not feasible because:
- SHAP values are additive by construction (to be precise SHapley Additive exPlanations are average marginal contributions over all possible feature coalitions)
- exp(a + b) != exp(a) + exp(b)
您可能会觉得有用:
二进制文件中的特征重要性 class化和提取 SHAP 值仅用于 classes 之一
GBTclassifier的base_value使用SHAP时如何解读?