RandomForest 和 XGB why/how 的形状值维度不同?有什么可以做的吗?
Shap value dimensions are different for RandomForest and XGB why/how? Is there something one can do about this?
从树解释器 .shap_values(some_data)
返回的 SHAP 值对于 XGB 和随机森林给出了不同的 dimensions/results。我试过研究它,但似乎找不到原因或方式,也找不到任何 Slundberg(SHAP dude)教程中的解释。所以:
- 是否有我遗漏的原因?
- 是否有一些标志表明 returns 每个 class 的 XGB 形状值与其他不明显或我遗漏的模型一样?
下面是一些示例代码!
import xgboost.sklearn as xgb
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import shap
bc = load_breast_cancer()
cancer_df = pd.DataFrame(bc['data'], columns=bc['feature_names'])
cancer_df['target'] = bc['target']
cancer_df = cancer_df.iloc[0:50, :]
target = cancer_df['target']
cancer_df.drop(['target'], inplace=True, axis=1)
X_train, X_test, y_train, y_test = train_test_split(cancer_df, target, test_size=0.33, random_state = 42)
xg = xgb.XGBClassifier()
xg.fit(X_train, y_train)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
xg_pred = xg.predict(X_test)
rf_pred = rf.predict(X_test)
rf_explainer = shap.TreeExplainer(rf, X_train)
xg_explainer = shap.TreeExplainer(xg, X_train)
rf_vals = rf_explainer.shap_values(X_train)
xg_vals = xg_explainer.shap_values(X_train)
print('Random Forest')
print(type(rf_vals))
print(type(rf_vals[0]))
print(rf_vals[0].shape)
print(rf_vals[1].shape)
print('XGBoost')
print(type(xg_vals))
print(xg_vals.shape)
输出:
Random Forest
<class 'list'>
<class 'numpy.ndarray'>
(33, 30)
(33, 30)
XGBoost
<class 'numpy.ndarray'>
(33, 30)
任何想法都有帮助!谢谢!
对于二进制class化:
XGBClassifier
(sklearn API) 的 SHAP 值是 1
class(一维) 的原始值
RandomForestClassifier
的 SHAP 值是 0
和 1
class 的概率(二维)。
演示版
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from scipy.special import expit
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
xgb = XGBClassifier(
max_depth=5, n_estimators=100, eval_metric="logloss", use_label_encoder=False
).fit(X_train, y_train)
xgb_exp = TreeExplainer(xgb)
xgb_sv = np.array(xgb_exp.shap_values(X_test))
xgb_ev = np.array(xgb_exp.expected_value)
print("Shape of XGB SHAP values:", xgb_sv.shape)
rf = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
rf_exp = TreeExplainer(rf)
rf_sv = np.array(rf_exp.shap_values(X_test))
rf_ev = np.array(rf_exp.expected_value)
print("Shape of RF SHAP values:", rf_sv.shape)
Shape of XGB SHAP values: (143, 30)
Shape of RF SHAP values: (2, 143, 30)
Interpretaion:
- XGBoost (143,30) dimensions:
- 143: number of samples in test
- 30: number of features
- RF (2,143,30) dimensions:
- 2: number of output classes
- 143: number of samples
- 30: number of features
要将 xgboost
SHAP 值与预测概率进行比较,从而 classes,您可以尝试将 SHAP 值添加到基本(预期)值。对于测试中的第 0 个数据点,它将是:
xgb_pred = expit(xgb_sv[0,:].sum() + xgb_ev)
assert np.isclose(xgb_pred, xgb.predict_proba(X_test)[0,1])
要将 RF
SHAP 值与第 0 个数据点的预测概率进行比较:
rf_pred = rf_sv[1,0,:].sum() + rf_ev[1]
assert np.isclose(rf_pred, rf.predict_proba(X_test)[0,1])
注意,此分析适用于 (i) sklearn
API 和 (ii) 二进制 class化。
从树解释器 .shap_values(some_data)
返回的 SHAP 值对于 XGB 和随机森林给出了不同的 dimensions/results。我试过研究它,但似乎找不到原因或方式,也找不到任何 Slundberg(SHAP dude)教程中的解释。所以:
- 是否有我遗漏的原因?
- 是否有一些标志表明 returns 每个 class 的 XGB 形状值与其他不明显或我遗漏的模型一样?
下面是一些示例代码!
import xgboost.sklearn as xgb
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import shap
bc = load_breast_cancer()
cancer_df = pd.DataFrame(bc['data'], columns=bc['feature_names'])
cancer_df['target'] = bc['target']
cancer_df = cancer_df.iloc[0:50, :]
target = cancer_df['target']
cancer_df.drop(['target'], inplace=True, axis=1)
X_train, X_test, y_train, y_test = train_test_split(cancer_df, target, test_size=0.33, random_state = 42)
xg = xgb.XGBClassifier()
xg.fit(X_train, y_train)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
xg_pred = xg.predict(X_test)
rf_pred = rf.predict(X_test)
rf_explainer = shap.TreeExplainer(rf, X_train)
xg_explainer = shap.TreeExplainer(xg, X_train)
rf_vals = rf_explainer.shap_values(X_train)
xg_vals = xg_explainer.shap_values(X_train)
print('Random Forest')
print(type(rf_vals))
print(type(rf_vals[0]))
print(rf_vals[0].shape)
print(rf_vals[1].shape)
print('XGBoost')
print(type(xg_vals))
print(xg_vals.shape)
输出:
Random Forest
<class 'list'>
<class 'numpy.ndarray'>
(33, 30)
(33, 30)
XGBoost
<class 'numpy.ndarray'>
(33, 30)
任何想法都有帮助!谢谢!
对于二进制class化:
XGBClassifier
(sklearn API) 的 SHAP 值是1
class(一维) 的原始值
RandomForestClassifier
的 SHAP 值是0
和1
class 的概率(二维)。
演示版
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from scipy.special import expit
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
xgb = XGBClassifier(
max_depth=5, n_estimators=100, eval_metric="logloss", use_label_encoder=False
).fit(X_train, y_train)
xgb_exp = TreeExplainer(xgb)
xgb_sv = np.array(xgb_exp.shap_values(X_test))
xgb_ev = np.array(xgb_exp.expected_value)
print("Shape of XGB SHAP values:", xgb_sv.shape)
rf = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
rf_exp = TreeExplainer(rf)
rf_sv = np.array(rf_exp.shap_values(X_test))
rf_ev = np.array(rf_exp.expected_value)
print("Shape of RF SHAP values:", rf_sv.shape)
Shape of XGB SHAP values: (143, 30)
Shape of RF SHAP values: (2, 143, 30)
Interpretaion:
- XGBoost (143,30) dimensions:
- 143: number of samples in test
- 30: number of features
- RF (2,143,30) dimensions:
- 2: number of output classes
- 143: number of samples
- 30: number of features
要将 xgboost
SHAP 值与预测概率进行比较,从而 classes,您可以尝试将 SHAP 值添加到基本(预期)值。对于测试中的第 0 个数据点,它将是:
xgb_pred = expit(xgb_sv[0,:].sum() + xgb_ev)
assert np.isclose(xgb_pred, xgb.predict_proba(X_test)[0,1])
要将 RF
SHAP 值与第 0 个数据点的预测概率进行比较:
rf_pred = rf_sv[1,0,:].sum() + rf_ev[1]
assert np.isclose(rf_pred, rf.predict_proba(X_test)[0,1])
注意,此分析适用于 (i) sklearn
API 和 (ii) 二进制 class化。