scikit中LinearSVC缩减后如何获取选中的特征

How to get the selected features after LinearSVC reduction in scikit

标题说明了一切,我已经查看了 scikit docs, which are very poor for this particular task, and I have checked several online resources, including this post.

然而,他们似乎错了。对于特征选择,我们可以这样做:

clf=LinearSVC(penalty="l1",dual=False,random_state=0)
X_reduced = clf.fit_transform(X_full,y_full)

现在,如果我们检查 X_reduced 的形状,就会非常清楚选择了多少特征。那么现在的问题是,哪些?

LinearSVCcoef_属性非常重要,建议对其进行迭代,选择coef_不为零的特征。好吧,这是错误的,但你可以得到非常接近真实结果的结果。

检查 X_reduced 后,我注意到我选择了 310 个特征,这是肯定的,我的意思是,我正在检查结果矩阵,现在,如果我执行 coef_ 操作,从总共2000个特征中选取了414个特征,接近真实

根据 scikit LinearSVC docs Threshold=None 涉及 mean(X) 但我卡住了,不知道现在该做什么。

UPDATE:这是一个 link,其中包含重现错误的数据和代码,它只有几 KB

我认为 LinearSVC() 确实 returns 具有非零系数的特征。您能否上传可以重现您看到的不一致的示例数据文件和代码脚本(例如,通过 dropbox 共享链接)?

from sklearn.datasets import make_classification
from sklearn.datasets import load_svmlight_file
from sklearn.svm import LinearSVC
import numpy as np

X, y = load_svmlight_file("/home/Jian/Downloads/errorScikit/weirdData")

transformer = LinearSVC(penalty='l1', dual=False, random_state=0)
transformer.fit(X, y)
# set threshold eps
X_reduced = transformer.transform(X, threshold=np.finfo(np.float).eps)

print(str(X_reduced.shape[1]) + " is NOW equal to " + str((transformer.coef_ != 0).sum()))

414 is NOW equal to 414


# as suggested by user3914041, if you want both sides are 310
transformer.transform(X).shape

Out[46]: (62, 310)

(abs(transformer.coef_) > 1e-5).sum()

Out[47]: 310