SHAP 不适用于 LightGBM 分类特征
SHAP not working with LightGBM categorical features
我的模型使用 LGBMClassifier
。我想使用 Shap (Shapley) 来解释特征。但是,Shap 在分类特征上给了我错误。例如,我有一个特征“吸烟者”,它的值包括“是”和“否”。我从 Shap 收到一个错误:
ValueError: could not convert string to float: 'Yes'.
我是否遗漏了任何设置?
顺便说一句,我知道我可以使用单热编码来转换分类特征,但我不想这样做,因为 LGBMClassifier
可以在没有单热编码的情况下处理分类特征。
示例代码如下:(shap版本为0.40.0,lightgbm版本为3.3.2)
import pandas as pd
from lightgbm import LGBMClassifier #My version is 3.3.2
import shap #My version is 0.40.0
#The training data
X_train = pd.DataFrame()
X_train["Age"] = [50, 20, 60, 30]
X_train["Smoker"] = ["Yes", "No", "No", "Yes"]
#Target: whether the person had a certain disease
y_train = [1, 0, 0, 0]
#I did convert categorical features to the Category data type.
X_train["Smoker"] = X_train["Smoker"].astype("category")
#The test data
X_test = pd.DataFrame()
X_test["Age"] = [50]
X_test["Smoker"] = ["Yes"]
X_test["Smoker"] = X_test["Smoker"].astype("category")
#the classifier
clf = LGBMClassifier()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
#shap
explainer = shap.TreeExplainer(clf)
#I see this setting from google search but it did not really help
explainer.model.original_model.params = {"categorical_feature":["Smoker"]}
shap_values = explainer(X_train) #the error came out here: ValueError: could not convert string to float: 'Yes'
让我们尝试稍微不同的方式:
from lightgbm import LGBMClassifier
import shap
X_train = pd.DataFrame({
"Age": [50, 20, 60, 30],
"Smoker": ["Yes", "No", "No", "Yes"]}
)
X_train["Smoker"] = X_train["Smoker"].astype("category")
y_train = [1, 0, 0, 0]
X_test = pd.DataFrame({"Age": [50], "Smoker": ["Yes"]})
X_test["Smoker"] = X_test["Smoker"].astype("category")
clf = LGBMClassifier(verbose=-1).fit(X_train, y_train)
predicted = clf.predict(X_test)
print("Predictions:", predicted)
exp = shap.TreeExplainer(clf)
sv = exp.shap_values(X_train) # <-- here
print(f"Expected values: {exp.expected_value}")
print(f"SHAP values for 0th data point: {sv[1][0]}")
Predictions: [0]
Expected values: [1.0986122886681098, -1.0986122886681098]
SHAP values for 0th data point: [0. 0.]
请注意,您无需修改 explainer.model.original_model.params
,因为它可以让您 non-intended public 访问模型的参数,这些参数已经通过训练为您设置型号。
我的模型使用 LGBMClassifier
。我想使用 Shap (Shapley) 来解释特征。但是,Shap 在分类特征上给了我错误。例如,我有一个特征“吸烟者”,它的值包括“是”和“否”。我从 Shap 收到一个错误:
ValueError: could not convert string to float: 'Yes'.
我是否遗漏了任何设置?
顺便说一句,我知道我可以使用单热编码来转换分类特征,但我不想这样做,因为 LGBMClassifier
可以在没有单热编码的情况下处理分类特征。
示例代码如下:(shap版本为0.40.0,lightgbm版本为3.3.2)
import pandas as pd
from lightgbm import LGBMClassifier #My version is 3.3.2
import shap #My version is 0.40.0
#The training data
X_train = pd.DataFrame()
X_train["Age"] = [50, 20, 60, 30]
X_train["Smoker"] = ["Yes", "No", "No", "Yes"]
#Target: whether the person had a certain disease
y_train = [1, 0, 0, 0]
#I did convert categorical features to the Category data type.
X_train["Smoker"] = X_train["Smoker"].astype("category")
#The test data
X_test = pd.DataFrame()
X_test["Age"] = [50]
X_test["Smoker"] = ["Yes"]
X_test["Smoker"] = X_test["Smoker"].astype("category")
#the classifier
clf = LGBMClassifier()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
#shap
explainer = shap.TreeExplainer(clf)
#I see this setting from google search but it did not really help
explainer.model.original_model.params = {"categorical_feature":["Smoker"]}
shap_values = explainer(X_train) #the error came out here: ValueError: could not convert string to float: 'Yes'
让我们尝试稍微不同的方式:
from lightgbm import LGBMClassifier
import shap
X_train = pd.DataFrame({
"Age": [50, 20, 60, 30],
"Smoker": ["Yes", "No", "No", "Yes"]}
)
X_train["Smoker"] = X_train["Smoker"].astype("category")
y_train = [1, 0, 0, 0]
X_test = pd.DataFrame({"Age": [50], "Smoker": ["Yes"]})
X_test["Smoker"] = X_test["Smoker"].astype("category")
clf = LGBMClassifier(verbose=-1).fit(X_train, y_train)
predicted = clf.predict(X_test)
print("Predictions:", predicted)
exp = shap.TreeExplainer(clf)
sv = exp.shap_values(X_train) # <-- here
print(f"Expected values: {exp.expected_value}")
print(f"SHAP values for 0th data point: {sv[1][0]}")
Predictions: [0]
Expected values: [1.0986122886681098, -1.0986122886681098]
SHAP values for 0th data point: [0. 0.]
请注意,您无需修改 explainer.model.original_model.params
,因为它可以让您 non-intended public 访问模型的参数,这些参数已经通过训练为您设置型号。