如何处理从嵌套交叉验证获得的网格搜索中的 best_score？

Question

我使用带有嵌套交叉验证的 GridSearch 优化了 RandomForest。之后，我知道使用最佳参数，我必须在对样本外数据进行预测之前训练整个数据集。

我必须对模型进行两次拟合吗？一种通过嵌套交叉验证然后使用样本外数据找到准确度估计？

请检查我的代码：

#Load data
for name in ["AWA"]:
for el in ['Fp1']:
    X=sio.loadmat('/home/TrainVal/{}_{}.mat'.format(name, el))['x']
    s_y=sio.loadmat('/home/TrainVal/{}_{}.mat'.format(name, el))['y']
    y=np.ravel(s_y)

    print(name, el, x.shape, y.shape) 
    print("")


#Pipeline
clf = Pipeline([('rcl', RobustScaler()),
                ('clf', RandomForestClassifier())])   

#Optimization
#Outer loop
sss_outer = StratifiedShuffleSplit(n_splits=2, test_size=0.1, random_state=1)
#Inner loop
sss_inner = StratifiedShuffleSplit(n_splits=2, test_size=0.1, random_state=1)


# Use a full grid over all parameters
param_grid = {'clf__n_estimators': [10, 12, 15],
              'clf__max_features': [3, 5, 10],
             }


# Run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=sss_inner, n_jobs=-1)
#FIRST FIT!!!!!
grid_search.fit(X, y)
scores=cross_val_score(grid_search, X, y, cv=sss_outer)

#Show best parameter in inner loop
print(grid_search.best_params_)

#Show Accuracy average of all the outer loops 
print(scores.mean())

#SECOND FIT!!!
y_score = grid_search.fit(X, y).score(out-of-sample, y)
print(y_score)

Answer 1

您的 grid_search.best_estimator_ 包含具有 best_params_ 参数的交叉验证拟合模型，无需再次改装。

您可以使用：

clf = grid_search.best_estimator_
preds = clf.predict(X_unseen)

Answer 2

您需要了解几件事。

当您执行 "first fit" 时，将根据 sss_inner cv 拟合 gird_search 模型，并将结果存储在 grid_search.best_estimator_ 中（即最佳估计器根据 sss_inner 次测试数据上的分数）。

现在您正在 cross_val_score（嵌套）中使用 grid_search。 "first fit" 中的拟合模型在这里没有用。 cross_val_score 将克隆估计器，在 sss_outer 的折叠上调用 grid_search.fit() （这意味着来自 sss_outer 的训练数据将呈现给 grid_search，它会再次根据 sss_inner) 拆分它并呈现 sss_outer 的测试数据上的分数。来自 cross_val_score 的模型不适合。

现在您的 "second fit" 中的您又像在 "first fit" 中一样合身了。不需要这样做，因为它已经安装好了。只需调用 grid_search.score()。它将从 best_estimator_.

内部调用 score()

您可以查看 my answer here 以了解有关使用网格搜索进行嵌套交叉验证的更多信息。

Answer 3

它就像您构建的任何普通模型一样。一旦你训练了你的模型（通过 CV 或正常的训练测试拆分），你使用 .score 或 .predict 使用来自 gridsearch 的 best_estimator 继续预测

我最近使用的示例代码

from sklearn.model_selection import GridSearchCV

bootstrap=[True,False]

max_features=[3,4,5,'auto']

n_estimators=[20,75,100]

import time

rf_model = RandomForestClassifier(random_state=1)

param_grid = dict(bootstrap=bootstrap,max_features=max_features,n_estimators=n_estimators)

grid = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv = 10, n_jobs=-1)
start_time = time.time()

grid_result=grid.fit(train_iv_data, train_dv_data)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

print("Execution time: " + str((time.time() - start_time)) + ' ms')

classification_report(grid_result.best_estimator_.predict(test_iv_data) , test_dv_data)

如何处理从嵌套交叉验证获得的网格搜索中的 best_score？

What to do with the best_score from a grid search obtained from a nested cross validation?

optimization

machine-learning

python-3.x

scikit-learn

cross-validation