RFECV 与 GridSearchCV 的评分有何区别？

Question

我正在尝试运行 RFECV select 最好的特征和 GridSearchCV 以获得最好的超参数。我的代码如下所示：

params = {'estimator__C': [1e-4, 1e4]}
estimator = LogisticRegression(random_state=123)
selector = RFECV(estimator, step=1, cv=5, scoring='recall')
clf = GridSearchCV(selector, params, cv=5)
clf.fit(X_train, y_train)

当我在 GridSearchCV 中包含相同的评分指标时，我得到了不同的最佳特征 n_features 和来自 cv_results 的参数。为什么会发生这种情况，哪些方法是正确的？

params = {'estimator__C': [1e-4, 1e4]}
estimator = LogisticRegression(random_state=123)
selector = RFECV(estimator, step=1, cv=5, scoring='recall')
clf = GridSearchCV(selector, params, cv=5, scoring='recall')
clf.fit(X_train, y_train)

Answer 1

Why is this happening

在第二种情况下，如果您没有明确指定 scoring，GridSearchCV 将使用所用估算器的默认评分，此处为 LogisticRegression；来自 docs:

scoring : string, callable, list/tuple, dict or None, default: None

[...]

If None, the estimator’s score method is used.

LogisticRegression 的得分是多少？再次来自 docs:

score (self, X, y, sample_weight=None)

Returns the mean accuracy on the given test data and labels.

因此，在第一种情况下，对于 GridSearchCV 部分，您将获得最大化 accuracy 的参数，而在第二种情况下，您将获得最大化 accuracy 的参数回忆。原则上，最大化这两个不同指标的参数不必相同（它们可以当然，但它们很可能不就像这里一样）。

which of these approaches is correct?

从技术上讲，这两种方法都是正确的；唯一可以回答这个问题的人就是你自己，它与你的业务问题更可取的指标有关。

也就是说，第一种方法确实看起来有点奇怪 - 为什么您要在 RFECV 和 GridSearchCV 期间针对两个 不同的 指标进行优化？至少在原则上，根据您选择的指标优化所有内容会更有意义。

再次请记住，所有这些技术实际上都是临时方法，背后没有太多理论；最终的裁判是实验。因此，如果您对最大限度地提高最终模型的准确性感兴趣，但您发现尝试最大化召回率的中间 RFECV 阶段最终会提供更好的整体准确性，您很可能只是去吧...

RFECV 与 GridSearchCV 的评分有何区别？

What's the difference between scoring in RFECV versus GridSearchCV?

python

machine-learning

scikit-learn

rfe

grid-search