grid.score(X_valid, y_valid) 和 grid.best_score_ 有什么区别
What is the difference between grid.score(X_valid, y_valid) and grid.best_score_
在做GridSearchCV时,通过grid.score(...)和grid.best_score_
得到的分数有什么区别
请假设模型、特征、目标和 param_grid 已到位。这是我非常想知道的部分代码。
grid = GridSearchCV(X_train, y_train)
grid.fit(X_train, y_train)
scores = grid.score(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
best_score_1 = scores
best_score_2 = grid.best_score_
best_score_1和best_score_2分别有两个不同的输出
我想知道两者之间的区别,以及以下哪些应该被认为是给定 param_grid.
的最佳分数
以下是完整的函数。
def apply_grid (df, model, features, target, params, test=False):
'''
Performs GridSearchCV after re-splitting the dataset, provides
comparison between train's MSE and test's MSE to check for
Generalization and optionally deploys the best-found parameters
on the Test Set as well.
Args:
df: DataFrame
model: a model to use
features: features to consider
target: labels
params: Param_Grid for Optimization
test: False by Default, if True, predicts on Test
Returns:
MSE scores on models and slice from the cv_results_
to compare the models generalization performance
'''
my_model = model()
# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(df[features],
df[target], random_state=0)
# Resplit the train dataset for GridSearchCV into train2 and valid to keep the test set separate
X_train2, X_valid, y_train2, y_valid = train_test_split(train[features],
train[target] , random_state=0)
# Use Grid Search to find the best parameters from the param_grid
grid = GridSearchCV(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
grid.fit(X_train2, y_train2)
# Evaluate on Valid set
scores = grid.score(X_valid, y_valid)
scores = scores # CONFUSION
print('Best MSE through GridSearchCV: ', grid.best_score_) # CONFUSION
print('Best MSE through GridSearchCV: ', scores)
print('I AM CONFUSED ABOUT THESE TWO OUTPUTS ABOVE. WHY ARE THEY DIFFERENT')
print('Best Parameters: ',grid.best_params_)
print('-'*120)
print('mean_test_score is rather mean_valid_score')
report = pd.DataFrame(grid.cv_results_)
# If test is True, deploy the best_params_ on the test set
if test == True:
my_model = model(**grid.best_params_)
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print('TEST MSE with the best params: ', mse)
print('-'*120)
return report[['mean_train_score', 'mean_test_score']]
已更新
如 sklearn documentation 中所述,GridSearchCV 获取您传递的参数的所有参数列表,并尝试所有可能的组合以找到最佳参数。
为了评估哪些是最佳参数,它为每个参数组合计算了 k 折交叉验证。通过k-fold cross-validation,将训练集分为Training set和Validation set(也就是测试集)。例如,如果选择 cv=5
,则将数据集分为 5 个不重叠的折叠,每个折叠用作验证集,而其他所有折叠用作训练集。因此,示例中的 GridSearchCV 计算 5 折中每一个的平均验证分数(可以是准确度或其他),并对每个参数组合也是如此。然后,在 GridsearchCV 的最后会有每个参数组合的平均验证分数,并返回平均验证分数最高的那个。因此,与最佳参数相关联的平均验证分数存储在 grid.best_score_
变量中。
另一方面,grid.score(X_valid, y_valid)
方法给出了给定数据上的分数,如果估计器经过改装(refit=True)
。这意味着它不是5折的平均准确率,但采用具有最佳参数的模型并使用训练集进行训练。然后,计算 X_valid
上的预测并与 y_valid
进行比较以获得分数。
在做GridSearchCV时,通过grid.score(...)和grid.best_score_
得到的分数有什么区别请假设模型、特征、目标和 param_grid 已到位。这是我非常想知道的部分代码。
grid = GridSearchCV(X_train, y_train)
grid.fit(X_train, y_train)
scores = grid.score(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
best_score_1 = scores
best_score_2 = grid.best_score_
best_score_1和best_score_2分别有两个不同的输出 我想知道两者之间的区别,以及以下哪些应该被认为是给定 param_grid.
的最佳分数以下是完整的函数。
def apply_grid (df, model, features, target, params, test=False):
'''
Performs GridSearchCV after re-splitting the dataset, provides
comparison between train's MSE and test's MSE to check for
Generalization and optionally deploys the best-found parameters
on the Test Set as well.
Args:
df: DataFrame
model: a model to use
features: features to consider
target: labels
params: Param_Grid for Optimization
test: False by Default, if True, predicts on Test
Returns:
MSE scores on models and slice from the cv_results_
to compare the models generalization performance
'''
my_model = model()
# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(df[features],
df[target], random_state=0)
# Resplit the train dataset for GridSearchCV into train2 and valid to keep the test set separate
X_train2, X_valid, y_train2, y_valid = train_test_split(train[features],
train[target] , random_state=0)
# Use Grid Search to find the best parameters from the param_grid
grid = GridSearchCV(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
grid.fit(X_train2, y_train2)
# Evaluate on Valid set
scores = grid.score(X_valid, y_valid)
scores = scores # CONFUSION
print('Best MSE through GridSearchCV: ', grid.best_score_) # CONFUSION
print('Best MSE through GridSearchCV: ', scores)
print('I AM CONFUSED ABOUT THESE TWO OUTPUTS ABOVE. WHY ARE THEY DIFFERENT')
print('Best Parameters: ',grid.best_params_)
print('-'*120)
print('mean_test_score is rather mean_valid_score')
report = pd.DataFrame(grid.cv_results_)
# If test is True, deploy the best_params_ on the test set
if test == True:
my_model = model(**grid.best_params_)
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print('TEST MSE with the best params: ', mse)
print('-'*120)
return report[['mean_train_score', 'mean_test_score']]
已更新
如 sklearn documentation 中所述,GridSearchCV 获取您传递的参数的所有参数列表,并尝试所有可能的组合以找到最佳参数。
为了评估哪些是最佳参数,它为每个参数组合计算了 k 折交叉验证。通过k-fold cross-validation,将训练集分为Training set和Validation set(也就是测试集)。例如,如果选择 cv=5
,则将数据集分为 5 个不重叠的折叠,每个折叠用作验证集,而其他所有折叠用作训练集。因此,示例中的 GridSearchCV 计算 5 折中每一个的平均验证分数(可以是准确度或其他),并对每个参数组合也是如此。然后,在 GridsearchCV 的最后会有每个参数组合的平均验证分数,并返回平均验证分数最高的那个。因此,与最佳参数相关联的平均验证分数存储在 grid.best_score_
变量中。
另一方面,grid.score(X_valid, y_valid)
方法给出了给定数据上的分数,如果估计器经过改装(refit=True)
。这意味着它不是5折的平均准确率,但采用具有最佳参数的模型并使用训练集进行训练。然后,计算 X_valid
上的预测并与 y_valid
进行比较以获得分数。