使用 GridSearch 时使用 Scikit-learn 的模型帮助

Question

作为安然项目的一部分，构建了附件模型，以下是步骤摘要，

以下模特给予高满分

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
    x_train, x_test = features[train_ind], features[test_ind]
    y_train, y_test = labels[train_ind],labels[test_ind]

    gcv.best_estimator_.predict(x_test)

以下模型给出了更合理但较低的分数

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
     x_train, x_test = features[train_ind], features[test_ind]
     y_train, y_test = labels[train_ind],labels[test_ind]

     gcv.best_estimator_.fit(x_train,y_train)
     gcv.best_estimator_.predict(x_test)

使用 Kbest 找出分数并对特征进行排序并尝试组合较高和较低的分数。
使用 StratifiedShuffle 的 GridSearch 使用 SVM
使用了best_estimator_来预测和计算precision和recall。

问题是估算器给出的是满分，在某些情况下是 1

但是，当我在训练数据上重新调整最佳分类器时，运行测试给出了合理的分数。

我的 doubt/question 正是 GridSearch 在使用我们发送给它的 Shuffle 拆分对象拆分后对测试数据所做的。我认为它不适合测试数据，如果这是真的，那么当我预测使用相同的测试数据时，它不应该给出这么高的分数。？因为我使用了 random_state 值，shufflesplit 应该为网格拟合和预测创建相同的副本。

所以，两个人使用同一个 Shufflesplit 是不是错了？

Answer 1

基本上网格搜索将：

尝试您的参数网格的每个组合
对于它们中的每一个，它将进行 K-fold 交叉验证
Select 最好的。

所以你的第二种情况很好。否则你实际上是在预测你训练过的数据（第二个选项不是这种情况，你只保留网格搜索中的最佳参数）

Answer 2

GridSearchCV 正如@Gauthier Feuillen 所说，用于搜索给定数据的估计器的最佳参数。 GridSearchCV 说明：-

gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params 将展开以使用 ParameterGrid.
features 现在将使用 cv 拆分为 features_train 和 features_test。 labels
现在 gridSearch 估计器（管道）将使用 features_train 和 labels_inner 进行训练，并使用 features_test 和 labels_test 进行评分。
对于步骤 3 中每个可能的参数组合，步骤 4 和 5 将重复 cv_iterations。将计算 cv 迭代的得分平均值，并将其分配给该参数组合。这可以使用 gridSearch 的 cv_results_ 属性访问。
对于给出最佳分数的参数，内部估计器将使用这些参数重新初始化，并重新调整提供给它的整个数据（特征和标签）。

由于最后一步，您在第一种方法和第二种方法中得到的分数不同。因为在第一种方法中，所有数据都用于训练，而您仅针对该数据进行预测。第二种方法对以前看不见的数据进行预测。

使用 GridSearch 时使用 Scikit-learn 的模型帮助

Model help using Scikit-learn when using GridSearch

python

machine-learning

scikit-learn

cross-validation

grid-search

以下模特给予高满分

以下模型给出了更合理但较低的分数