如何找到哪些特征负责预测标签？

Question

我正在做一个机器学习项目，想知道如何通过使用 sklearn 在 python 中找到负责预测标签的最佳特征。

假设我们拟合模型并且想要预测 model.predict([1,2,3])-> let suppose it says you passed the test. 但是仅针对此预测进行预测的特征权重是多少 model.predict([1,2,3])

假设一个数据集有 5 列。我们称它们为：id、X_1、X_2、X_3、result。 X_1,X_2,X_3 的数值为 1-5.

我需要证明这个结果是由权重为0.8900%和0.3900%的X_1、X_2或任何我可以完全理解的图表引起的。我如何证明 X_1 和 X_2 比 X_3 对结果的影响更大？仅针对此预测 model.predict([1,2,3])

我到处检查但没有得到任何代码。我需要一个简单的答案或任何可以帮助我解决这个问题的代码。

Answer 1

嗯，这真的取决于你的数据、模型和你想要实现的目标。也就是说，最简单的方法是进行不同的实验并比较结果。因此，使用 X_1、X_2 和 X_3 创建一个模型，然后使用 X_1 和 X_2.

创建一个模型

更复杂的解决方案可能是使用特征选择。 Here a short introduction. 例如，您可以使用 feature importance 来了解每个特征对预测的贡献程度。 An easy example with code can be found here.

**Example with a random forest model:**
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot

# define dataset
X, y = make_regression(n_samples=1000, n_features=3, n_informative=2, random_state=42)
# define the model
model = RandomForestRegressor()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: X_ %0d, Score: %.5f' % (i+1,v))

在输出中我们可以看到 X_3 比 X_1 对预测的贡献更大，所以制作另一个模型可能是一个想法（如果我们从一开始就怀疑的话） X_1 和 X_2。我们也可以考虑排除 X_1，因为如果我们担心数据的维度，它对预测的贡献不大。:

请记住，这不是唯一的方法，而是众多方法中的一种。这实际上取决于您拥有哪些数据，您正在使用哪些模型以及您正在尝试做什么。

编辑： 正如您现在询问的预测。您可以使用 LIME the shed some light into how different features influence your predictions. As i don't know your code I can't really provide correct code for your case. For implementation you can look here 或简单地通过谷歌搜索。示例代码如下所示：

import lime
import lime.lime_tabular
 # LIME has one explainer for all the models
explainer = lime.lime_tabular.LimeTabularExplainer(X, verbose=True, mode='regression')

# Choose the 5th instance and use it to predict the results
j = 5
exp = explainer.explain_instance(X[j], model.predict, num_features=3)
# Show the predictions
exp.show_in_notebook(show_table=True)

输出看起来像这样：

所以这里的解释可能是，特征 0 和特征 2 对预测的贡献最大，而且特征 2 可能指向更负面的预测方向。

如何找到哪些特征负责预测标签？

How to find which features are responsible for predicted label?

python

machine-learning

dataframe

pandas

data-science