使用 SelectPercentile（来自 sklearn）和 SVM 作为分类器时得分过高

Question

在 Python 中，我应用了 SelectPercentile（来自 sklearn）以便仅使用最相关的特征并训练了 SVM 分类器。我想提一下，我只有一个语料库，所以我必须在这个语料库上执行cross_validation。
使用 SelectPercentile 选择功能后，当我使用 cross_validation 时，我得到的分数太高，我认为我做错了什么，但我无法弄清楚是什么。我认为 X_all 矩阵有重复的行或重复的列，但它没有。

我不明白为什么会得到这个结果。任何人都可以让我了解引擎盖下发生的事情以及我做错了什么吗？

实施

# 仅从数据集中提取单词
# 使用 Pandas

创建数据框

数据框具有以下结构：
- 数据：只包含没有任何停用词的单词
- 性别：1 或 0

vectorizer = TfidfVectorizer(lowercase=False, min_df=1)
X_all = vectorizer.fit_transform(dataframe.data)
y_all = dataframe.gender

selector = SelectPercentile(f_classif, percentile=10)

selector.fit(X_all, y_all)
X_all = selector.transform(X_all)

classifier = svm.SVC()

param_grid = [
    {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
    {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

gs = GridSearchCV(classifier, param_grid, cv=5, n_jobs=4)
gs.fit(X_all.toarray(), y_all)
sorted(gs.grid_scores_, key=lambda x: x.mean_validation_score, reverse=True)

print gs.best_score_
print gs.best_params_

使用所有 150 个样本获得的分数：
没有 SelectPercentile：0.756（9704 个特征）

with percentile=90: 0.822 （8733 个特征）
百分位数=70：0.947（6792 个特征）
百分位 =50：0.973（4852 个特征）
with percentile=30: 0.967 （2911 个特征）
百分位 =10：0.970（971 个特征）
with percentile=3 : 0.910 （292 个特征）
with percentile=1 : 0.820 （98 个特征）

另一方面，我尝试了另一种方法，并将我拥有的 150 个样本分成训练和测试如下：

features_train, features_test, target_train, target_test = train_test_split(X_all, y_all, test_size=0.20, random_state=0)


selector = SelectPercentile(f_classif, percentile=10)
selector.fit(features_train, target_train)

features_train = selector.transform(features_train).toarray()
features_test = selector.transform(features_test).toarray()

classifier = svm.SVC().fit(features_train, target_train)
print("Training score: {0:.1f}%".format(classifier.score(features_test, target_test) * 100))

使用这种方法，我收到警告：

"/usr/local/lib/python2.7/dist-packages/sklearn/feature_selection/univariate_selection.py:113: UserWarning: 特征 [0 0 0 ..., 0 0 0] 是常量。UserWarning)"

并且无论百分位切片是什么（10、30、50、... 99），所有结果都是常数： 44.3%

Answer 1

我认为您不应该使用所有数据 (X_all) 执行特征选择 (SelectPercentile)。通过这样做，您将在交叉验证 'leaked' 中测试的数据放入您的模型中。因此，您的特征选择会看到测试集中的数据，并告诉您的分类器与训练集和测试集中的标签相关的特征子集。

您应该使用 Pipeline 将 FS 与您的分类器链接起来，并为模型评估执行交叉验证。

但我认为您使用单变量特征选择后跟 SVM 的方法在文本分类问题上可能不如 SVD-SVM 管道。查看 this answer 示例脚本。

使用 SelectPercentile（来自 sklearn）和 SVM 作为分类器时得分过高

Getting scores too high when using SelectPercentile (from sklearn) and SVM as classifier

python

svm

tf-idf

scikit-learn

cross-validation