使用 sklearn 进行多标签特征选择

Question

我正在寻找使用 sklearn 对多标签数据集执行特征 selection。我想获得最终的特征集 across 标签，然后我将在另一个机器学习包中使用它们。我打算使用我看到的方法 here，其中 select 分别为每个标签提供相关功能。

from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.multiclass import OneVsRestClassifier
clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
                ('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)

然后我计划使用以下方法提取每个标签包含的特征的索引：

selected_features = []
for i in multi_clf.estimators_:
    selected_features += list(i.named_steps["chi2"].get_support(indices=True))

现在，我的问题是，如何选择要包含在最终模型中的 selected 特征？我可以使用每一个独特的特征（包括只与一个标签相关的特征），或者我可以对 select 与更多标签相关的特征做一些事情。

我最初的想法是创建 select 编辑给定特征的标签数量的直方图，并根据目视检查确定阈值。我担心的是这种方法是主观的。使用 sklearn 对多标签数据集执行特征 selection 是否有更原则的方法？

Answer 1

http://scikit-learn.org/stable/modules/feature_selection.html

选项很多，但 SelectKBest 和 Recursive feature elimination 是两个相当受欢迎的选项。

RFE 的工作原理是将统一的特征排除在模型之外，然后重新训练并比较结果，以便最后留下的特征能够实现最佳预测精度。

什么是最好的在很大程度上取决于您的数据和用例。

除了可以粗略地描述为特征选择的交叉验证方法之外，您还可以看看贝叶斯模型选择，这是一种更理论化的方法，倾向于使用更简单的模型而不是复杂的模型。

Answer 2

根据paper中的结论：

[...] rank features according to the average or the maximum Chi-squared score across all labels, led to most of the best classifiers while using less features.

然后，为了 select 一个好的功能子集，您只需要做（类似的事情）：

from sklearn.feature_selection import chi2, SelectKBest

selected_features = [] 
for label in labels:
    selector = SelectKBest(chi2, k='all')
    selector.fit(X, Y[label])
    selected_features.append(list(selector.scores_))

// MeanCS 
selected_features = np.mean(selected_features, axis=0) > threshold
// MaxCS
selected_features = np.max(selected_features, axis=0) > threshold

注意：在上面的代码中，我假设 X 是某些文本矢量化器（文本的矢量化版本）的输出，Y 是一个 pandas 数据框，每个标签一列（所以我可以select列Y[label]）。此外，还有一个阈值变量应该事先固定。

使用 sklearn 进行多标签特征选择

Multi-label feature selection using sklearn

machine-learning

feature-selection

python-2.7

scikit-learn

multilabel-classification