具有 rbf 内核的 SVC 的前 10 个特征

Question

我正在尝试获取具有 RBF 内核的 SVM 分类器的前 10 个最有用（最好）的特征。由于我是编程初学者，所以我尝试了一些在网上找到的代码。不幸的是，none 工作。我总是收到错误消息：ValueError: coef_ is only available when using a linear kernel.

这是我测试的最后一个代码：

scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
vec = DictVectorizer()

feat_sel = SelectKBest(mutual_info_classif, k=200)

# Pipeline for SVM classifier
clf = SVC()
pipe = Pipeline([('vectorizer', vec),
             ('scaler', StandardScaler(with_mean=False)),
             ('mutual_info', feat_sel),
             ('svc', clf)])


y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)


# Now fit the pipeline using your data
pipe.fit(instances, y)

def show_most_informative_features(vec, clf, n=10):
    feature_names = vec.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        return ('\t%.4f\t%-15s\t\t%.4f\t%-15s' % (coef_1, fn_1, coef_2, fn_2))
print(show_most_informative_features(vec, clf))

有人没有办法从具有 RBF 内核的分类器中获得前 10 个特征吗？或者另一种可视化最佳特征的方法？

Answer 1

我不确定你所问的 RBF 内核是否可以与你展示的示例类似的方式（正如你的错误所暗示的那样，它只适用于线性内核）。

但是，您可以随时尝试 feature ablation；一个一个地删除每个功能并测试它如何影响性能。对性能影响最大的 10 个特征是您的 "top 10 features".

显然，这只有在 (1) 您的特征相对较少 and/or (2) 训练和测试您的模型不需要很长时间的情况下才有可能。

Answer 2

至少有两个选项可用于 SVM classifier with RBF kernel within the scikit-learn Python 模块的功能选择

如果您正在表演 univariate classification, you can use SelectKBest 对于分类特征选择，可以在 SelectKBest 中指定以下评分函数之一：

chi-squared ，对于 non-negative 特征
ANOVA F-value ，假设线性相关
Mutual Information，用于一般依赖。需要更多样本。

对于稀疏数据集，SequentialFeatureSelector 可以根据分类器的 cross-validation 分数逐步添加或删除特征（向前或向后选择）。虽然 SFS 不要求分类器模型公开“coef_”或“feature_importances_”属性，但它可能会很慢，因为它需要运行 m*k 拟合（即 adding/removing m k-fold cross-validation).

Liu et.al 还提出了一种递归特征消除 (RFE) 特征选择方法。在 Feature selection for support vector machines with RBF kernel.

具有 rbf 内核的 SVC 的前 10 个特征

Top 10 features SVC with rbf kernel

python

svm

feature-selection

scikit-learn