使用 shap 包获取数据框中要素的瀑布图值
Get waterfall plot values of a feature in a dataframe using shap package
我正在使用随机森林模型和神经网络进行二元分类,其中使用 SHAP 来解释模型预测。我跟着教程写了下面的代码来得到下面显示的瀑布图
在 Sergey Bushmanaov 的 SO post here 的帮助下,我成功地将瀑布图导出到数据框。但这不会复制列的特征值。它只复制形状值、expected_value 和特征名称。但我也想要功能名称。所以,我尝试了以下
shap.waterfall_plot(shap.Explanation(values=shap_values[1])[4],base_values=explainer.expected_value[1],data=ord_test_t.iloc[4],feature_names=ord_test_t.columns.tolist())
但这引发了错误
TypeError: waterfall() got an unexpected keyword argument
'base_values'
我希望我的输出如下所示。我使用 1 点的背景来计算基值。但您也可以自由使用背景 1,10 或 100。在下面的输出中,我将值和特征存储在一个名为 Feature
的列中。这类似于 LIME
。但不确定 SHAP 是否具有执行此操作的灵活性?
更新-情节
更新代码 - 内核解释器瀑布到数据帧
masker = Independent(X_train, max_samples=100)
explainer = KernelExplainer(rf_boruta.predict,X_train)
bv = explainer.expected_value
sv = explainer.shap_values(X_train)
sdf_train = pd.DataFrame({
'row_id': X_train.index.values.repeat(X_train.shape[1]),
'feature': X_train.columns.to_list() * X_train.shape[0],
'feature_value': X_train.values.flatten(),
'base_value': bv,
'shap_values': sv.values[:,:,1].flatten() # i changed this to pd.DataFrame(sv).values[:,1].flatten()
})
尝试以下操作:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=X.values,
feature_names=X.columns)
idx = 0
waterfall(exp[idx])
0.39.0
然后:
pd.DataFrame({
'row_id':idx,
'feature': X.columns,
'feature_value': exp[idx].values,
'base_value': exp[idx].base_values,
'shap_values': exp[idx].values
})
#expected output
row_id feature feature_value base_value shap_values
0 0 mean radius -0.035453 0.628998 -0.035453
1 0 mean texture 0.047571 0.628998 0.047571
2 0 mean perimeter -0.036218 0.628998 -0.036218
3 0 mean area -0.041276 0.628998 -0.041276
4 0 mean smoothness -0.006842 0.628998 -0.006842
5 0 mean compactness -0.009275 0.628998 -0.009275
6 0 mean concavity -0.035188 0.628998 -0.035188
7 0 mean concave points -0.051165 0.628998 -0.051165
8 0 mean symmetry -0.002192 0.628998 -0.002192
9 0 mean fractal dimension 0.001521 0.628998 0.001521
10 0 radius error -0.021223 0.628998 -0.021223
11 0 texture error -0.000470 0.628998 -0.000470
12 0 perimeter error -0.021423 0.628998 -0.021423
13 0 area error -0.035313 0.628998 -0.035313
14 0 smoothness error -0.000060 0.628998 -0.000060
15 0 compactness error 0.001053 0.628998 0.001053
16 0 concavity error -0.002988 0.628998 -0.002988
17 0 concave points error 0.000140 0.628998 0.000140
18 0 symmetry error 0.001238 0.628998 0.001238
19 0 fractal dimension error -0.001097 0.628998 -0.001097
20 0 worst radius -0.050027 0.628998 -0.050027
21 0 worst texture 0.038056 0.628998 0.038056
22 0 worst perimeter -0.079717 0.628998 -0.079717
23 0 worst area -0.072312 0.628998 -0.072312
24 0 worst smoothness -0.006917 0.628998 -0.006917
25 0 worst compactness -0.016184 0.628998 -0.016184
26 0 worst concavity -0.022500 0.628998 -0.022500
27 0 worst concave points -0.088697 0.628998 -0.088697
28 0 worst symmetry -0.026166 0.628998 -0.026166
29 0 worst fractal dimension -0.007683 0.628998 -0.007683
RandomForest
有点特殊,是这个原因。当新地块 API 出现问题时,请尝试提供 Explanation
对象。
更新
解释 1
class 数据点与单个 0
背景数据点:
back_id = 10
exp_id = 20
explainer = TreeExplainer(model, data=X.loc[[back_id]])
sv = explainer(X.loc[[exp_id]])
exp = Explanation(sv.values[:,:,1], sv.base_values[:,1], data=X.loc[[back_id]].values, feature_names=X.columns)
waterfall(exp[0])
最后,正如您要求的所有内容都采用建议的格式:
from shap.maskers import Independent
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
masker = Independent(X_train, max_samples=100)
explainer = TreeExplainer(model, data=masker)
bv = explainer.expected_value[1]
sv = explainer(X_test, check_additivity=False)
pd.DataFrame({
'row_id': X_test.index.values.repeat(X_test.shape[1]),
'feature': X_test.columns.to_list() * X_test.shape[0],
'feature_value': X_test.values.flatten(),
'base_value': bv,
'shap_values': sv.values[:,:,1].flatten()
})
但我绝对不会给我妈妈看这个。
我正在使用随机森林模型和神经网络进行二元分类,其中使用 SHAP 来解释模型预测。我跟着教程写了下面的代码来得到下面显示的瀑布图
在 Sergey Bushmanaov 的 SO post here 的帮助下,我成功地将瀑布图导出到数据框。但这不会复制列的特征值。它只复制形状值、expected_value 和特征名称。但我也想要功能名称。所以,我尝试了以下
shap.waterfall_plot(shap.Explanation(values=shap_values[1])[4],base_values=explainer.expected_value[1],data=ord_test_t.iloc[4],feature_names=ord_test_t.columns.tolist())
但这引发了错误
TypeError: waterfall() got an unexpected keyword argument 'base_values'
我希望我的输出如下所示。我使用 1 点的背景来计算基值。但您也可以自由使用背景 1,10 或 100。在下面的输出中,我将值和特征存储在一个名为 Feature
的列中。这类似于 LIME
。但不确定 SHAP 是否具有执行此操作的灵活性?
更新-情节
更新代码 - 内核解释器瀑布到数据帧
masker = Independent(X_train, max_samples=100)
explainer = KernelExplainer(rf_boruta.predict,X_train)
bv = explainer.expected_value
sv = explainer.shap_values(X_train)
sdf_train = pd.DataFrame({
'row_id': X_train.index.values.repeat(X_train.shape[1]),
'feature': X_train.columns.to_list() * X_train.shape[0],
'feature_value': X_train.values.flatten(),
'base_value': bv,
'shap_values': sv.values[:,:,1].flatten() # i changed this to pd.DataFrame(sv).values[:,1].flatten()
})
尝试以下操作:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=X.values,
feature_names=X.columns)
idx = 0
waterfall(exp[idx])
0.39.0
然后:
pd.DataFrame({
'row_id':idx,
'feature': X.columns,
'feature_value': exp[idx].values,
'base_value': exp[idx].base_values,
'shap_values': exp[idx].values
})
#expected output
row_id feature feature_value base_value shap_values
0 0 mean radius -0.035453 0.628998 -0.035453
1 0 mean texture 0.047571 0.628998 0.047571
2 0 mean perimeter -0.036218 0.628998 -0.036218
3 0 mean area -0.041276 0.628998 -0.041276
4 0 mean smoothness -0.006842 0.628998 -0.006842
5 0 mean compactness -0.009275 0.628998 -0.009275
6 0 mean concavity -0.035188 0.628998 -0.035188
7 0 mean concave points -0.051165 0.628998 -0.051165
8 0 mean symmetry -0.002192 0.628998 -0.002192
9 0 mean fractal dimension 0.001521 0.628998 0.001521
10 0 radius error -0.021223 0.628998 -0.021223
11 0 texture error -0.000470 0.628998 -0.000470
12 0 perimeter error -0.021423 0.628998 -0.021423
13 0 area error -0.035313 0.628998 -0.035313
14 0 smoothness error -0.000060 0.628998 -0.000060
15 0 compactness error 0.001053 0.628998 0.001053
16 0 concavity error -0.002988 0.628998 -0.002988
17 0 concave points error 0.000140 0.628998 0.000140
18 0 symmetry error 0.001238 0.628998 0.001238
19 0 fractal dimension error -0.001097 0.628998 -0.001097
20 0 worst radius -0.050027 0.628998 -0.050027
21 0 worst texture 0.038056 0.628998 0.038056
22 0 worst perimeter -0.079717 0.628998 -0.079717
23 0 worst area -0.072312 0.628998 -0.072312
24 0 worst smoothness -0.006917 0.628998 -0.006917
25 0 worst compactness -0.016184 0.628998 -0.016184
26 0 worst concavity -0.022500 0.628998 -0.022500
27 0 worst concave points -0.088697 0.628998 -0.088697
28 0 worst symmetry -0.026166 0.628998 -0.026166
29 0 worst fractal dimension -0.007683 0.628998 -0.007683
RandomForest
有点特殊,是这个原因。当新地块 API 出现问题时,请尝试提供 Explanation
对象。
更新
解释 1
class 数据点与单个 0
背景数据点:
back_id = 10
exp_id = 20
explainer = TreeExplainer(model, data=X.loc[[back_id]])
sv = explainer(X.loc[[exp_id]])
exp = Explanation(sv.values[:,:,1], sv.base_values[:,1], data=X.loc[[back_id]].values, feature_names=X.columns)
waterfall(exp[0])
最后,正如您要求的所有内容都采用建议的格式:
from shap.maskers import Independent
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
masker = Independent(X_train, max_samples=100)
explainer = TreeExplainer(model, data=masker)
bv = explainer.expected_value[1]
sv = explainer(X_test, check_additivity=False)
pd.DataFrame({
'row_id': X_test.index.values.repeat(X_test.shape[1]),
'feature': X_test.columns.to_list() * X_test.shape[0],
'feature_value': X_test.values.flatten(),
'base_value': bv,
'shap_values': sv.values[:,:,1].flatten()
})
但我绝对不会给我妈妈看这个。