具有单独训练和验证集的 GridSearchCV 错误地也考虑了最终选择最佳模型的训练结果

Question

我有一个包含 3500 个观察值 x 70 个特征的数据集，这是我的训练集，我还有一个包含 600 个观察值 x 70 个特征的数据集，这是我的验证集。目标是将观察结果正确分类为 0 或 1。

我使用 Xgboost 并且我的目标是在分类阈值 = 0.5.

时达到尽可能高的精度

我正在进行网格搜索：

import numpy as np
import pandas as pd
import xgboost

# Import datasets from edge node
data_train = pd.read_csv('data.csv')
data_valid = pd.read_csv('data_valid.csv')
 
# Specify 'data_test' as validation set for the Grid Search below
from sklearn.model_selection import PredefinedSplit
X, y, train_valid_indices = train_valid_merge(data_train, data_valid)
train_valid_merge_indices = PredefinedSplit(test_fold=train_valid_indices)

# Define my own scoring function to see
# if it is called for both the training and the validation sets
from sklearn.metrics import make_scorer
custom_scorer = make_scorer(score_func=my_precision, greater_is_better=True, needs_proba=False)

# Instantiate xgboost
from xgboost.sklearn import XGBClassifier
classifier = XGBClassifier(random_state=0)

# Small parameters' grid ONLY FOR START
# I plan to use way bigger parameters' grids 
parameters = {'n_estimators': [150, 175, 200]}

# Execute grid search and retrieve the best classifier
from sklearn.model_selection import GridSearchCV
classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=custom_scorer,
                                   cv=train_valid_merge_indices, refit=True, n_jobs=-1)
classifiers_grid.fit(X, y)

................................................ .....................

train_valid_merge - 指定我自己的验证集：

我想用我的训练集 (data_train) 对每个模型进行训练，并用我的 distinct/separate 验证集 (data_valid) 进行超参数调整。出于这个原因，我定义了一个名为 train_valid_merge 的函数，它将我的训练集和验证集连接起来，以便它们可以被馈送到 GridSeachCV 并且我还使用 PredefineSplit 来指定哪个是训练和这是此合并集的验证集：

def train_valid_merge(data_train, data_valid):

    # Set test_fold values to -1 for training observations
    train_indices = [-1]*len(data_train)

    # Set test_fold values to 0 for validation observations
    valid_indices = [0]*len(data_valid)

    # Concatenate the indices for the training and validation sets
    train_valid_indices = train_indices + valid_indices

    # Concatenate data_train & data_valid
    import pandas as pd
    data = pd.concat([data_train, data_valid], axis=0, ignore_index=True)
    X = data.iloc[:, :-1].values
    y = data.iloc[:, -1].values
    return X, y, train_valid_indices

................................................ .....................

custom_scorer - 指定我自己的评分标准：

我定义了我自己的评分函数，它简单地 returns 精度只是为了看它是否被调用用于训练和验证集：

def my_precision(y_true, y_predict):

    # Check length of 'y_true' to see if it is the training or the validation set
    print(len(y_true))

    # Calculate precision
    from sklearn.metrics import precision_score
    precision = precision_score(y_true, y_predict, average='binary')

    return precision

................................................ .....................

当我运行整个事情（对于 parameters = {'n_estimators': [150, 175, 200]}）然后下面的事情从 print(len(y_true)) 在 my_precision 函数打印：

这意味着对训练集和验证集都调用了评分函数。但我已经测试过，评分函数不仅被调用，而且来自训练集和验证集的结果用于确定网格搜索的最佳模型（即使我已指定它仅使用验证集结果）。

例如，对于我们的 3 个参数值 ('n_estimators': [150, 175, 200])，它考虑了训练集和验证集（2 组）的分数，因此它产生了（3 个参数）x（2 组） = 6 种不同的网格结果。所以它从所有这些网格结果中挑选出最好的超参数集，因此它最终可能会从训练集的结果中挑选出一个，而我只想考虑验证集（3 个结果）。

但是，如果我向 my_precision 函数添加类似的东西来规避训练集（通过将其所有精度值设置为 0）：

# Remember that the training set has 3500 observations
# and the validation set 600 observations
if(len(y_true>600)):
    return 0

然后（据我测试）我当然得到了我的规格的最佳模型，因为训练集精度结果太小，因为它们都是 0 到。

我的问题如下：

为什么自定义评分函数同时考虑了训练集和验证集来挑选最佳模型，而我已经在 train_valid_merge_indices 中指定应该只选择网格搜索的最佳模型根据验证集?

如何让 GridSearchCV 在完成模型的选择和排名时只考虑验证集和模型的分数？

Answer 1

I have one distinct training set and one distinct validation set. I want to train my model on the training set and find the best hyperparameters based on its performance on my distinct validation set.

那么你肯定不需要 PredefinedSplit 也不需要 GridSearchCV:

import pandas as pd
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import precision_score

# Import datasets from edge node
data_train = pd.read_csv('data.csv')
data_valid = pd.read_csv('data_valid.csv')

# training data & labels:
X = data_train.iloc[:, :-1].values
y = data_train.iloc[:, -1].values   

# validation data & labels:
X_valid = data_valid.iloc[:, :-1].values
y_true = data_valid.iloc[:, -1].values 

n_estimators = [150, 175, 200]
perf = []

for k_estimators in n_estimators:
    clf = XGBClassifier(n_estimators=k_estimators, random_state=0)
    clf.fit(X, y)

    y_predict = clf.predict(X_valid)
    precision = precision_score(y_true, y_predict, average='binary')
    perf.append(precision)

和 perf 将包含您各自的分类器在您的验证集上的表现...

Answer 2

which means that the scoring function is called both for the training and the validation set...

这可能是真的。

...But I have tested that that the scoring function is not only called but its results from both the training and validation sets are used to determine the best model from the grid search (even though I have specified it to use only the validation set results).

但这可能不是真的。

有个参数return_train_score；当 True 时，训练数据的分数和 returns 作为 cv_results_ 属性的一部分的分数。 v0.21之前，该参数默认为True，之后为False。但是，这些分数 而不是 用于确定最佳超参数，除非您有一个客户 scoring 方法将它们考虑在内。（如果您认为自己有反例，请提供 cv_results_ 和 best_params_。）

Why the custom scoring function is taking into account both the training and the validation set to pick out the best model while I have specified with my train_valid_merge_indices that the best model for the Grid Search should be only selected according to the validation set?

（可能）不是，见上文。

How can I make the GridSearchCV to account only for the validation set and the score of the models at it when the selection and the ranking of the models will be done?

它默认执行此操作。

具有单独训练和验证集的 GridSearchCV 错误地也考虑了最终选择最佳模型的训练结果

GridSeachCV with separate training & validation sets erroneously takes also into account the training results for finally choosing the best model

python

machine-learning

scikit-learn

cross-validation

grid-search