网格搜索 Returns 给定自定义模型的结果完全相同
Grid Search Returns the Exactly Same Result Given a Custom Model
我将 Scikit-Learn 随机森林模型包装在一个函数中,如下所示:
from sklearn.base import BaseEstimator, RegressorMixin
class Model(BaseEstimator, RegressorMixin):
def __init__(self, model):
self.model = model
def fit(self, X, y):
self.model.fit(X, y)
return self
def score(self, X, y):
from sklearn.metrics import mean_squared_error
return mean_squared_error(y_true=y,
y_pred=self.model.predict(X),
squared=False)
def predict(self, X):
return self.model.predict(X)
class RandomForest(Model):
def __init__(self, n_estimators=100,
max_depth=None, min_samples_split=2,
min_samples_leaf=1, max_features=None):
self.n_estimators=n_estimators
self.max_depth=max_depth
self.min_samples_split=min_samples_split
self.min_samples_leaf=min_samples_leaf
self.max_features=max_features
from sklearn.ensemble import RandomForestRegressor
self.model = RandomForestRegressor(n_estimators=self.n_estimators,
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
max_features=self.max_features,
random_state = 777)
def get_params(self, deep=True):
return {"n_estimators": self.n_estimators,
"max_depth": self.max_depth,
"min_samples_split": self.min_samples_split,
"min_samples_leaf": self.min_samples_leaf,
"max_features": self.max_features}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self
我主要遵循 Scikit-Learn 官方指南,可以在 https://scikit-learn.org/stable/developers/develop.html
找到
这是我的网格搜索的样子:
grid_search = GridSearchCV(estimator=RandomForest(),
param_grid={'max_depth':[1, 3, 6], 'n_estimators':[10, 100, 300]},
n_jobs=-1,
scoring='neg_root_mean_squared_error',
cv=5, verbose=True).fit(X, y)
print(pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score'))
网格搜索输出结果和grid_search.cv_results_打印如下
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
mean_fit_time std_fit_time mean_score_time std_score_time \
0 0.210918 0.002450 0.016754 0.000223
1 0.207049 0.001675 0.016579 0.000147
2 0.206495 0.002001 0.016598 0.000158
3 0.206799 0.002417 0.016740 0.000144
4 0.207534 0.001603 0.016668 0.000269
5 0.206384 0.001396 0.016605 0.000136
6 0.220052 0.024280 0.017247 0.001137
7 0.226838 0.027507 0.017351 0.000979
8 0.205738 0.003420 0.016246 0.000626
param_max_depth param_n_estimators params \
0 1 10 {'max_depth': 1, 'n_estimators': 10}
1 1 100 {'max_depth': 1, 'n_estimators': 100}
2 1 300 {'max_depth': 1, 'n_estimators': 300}
3 3 10 {'max_depth': 3, 'n_estimators': 10}
4 3 100 {'max_depth': 3, 'n_estimators': 100}
5 3 300 {'max_depth': 3, 'n_estimators': 300}
6 6 10 {'max_depth': 6, 'n_estimators': 10}
7 6 100 {'max_depth': 6, 'n_estimators': 100}
8 6 300 {'max_depth': 6, 'n_estimators': 300}
split0_test_score split1_test_score split2_test_score split3_test_score \
0 -5.246725 -3.200585 -3.326962 -3.209387
1 -5.246725 -3.200585 -3.326962 -3.209387
2 -5.246725 -3.200585 -3.326962 -3.209387
3 -5.246725 -3.200585 -3.326962 -3.209387
4 -5.246725 -3.200585 -3.326962 -3.209387
5 -5.246725 -3.200585 -3.326962 -3.209387
6 -5.246725 -3.200585 -3.326962 -3.209387
7 -5.246725 -3.200585 -3.326962 -3.209387
8 -5.246725 -3.200585 -3.326962 -3.209387
split4_test_score mean_test_score std_test_score rank_test_score
0 -2.911422 -3.579016 0.845021 1
1 -2.911422 -3.579016 0.845021 1
2 -2.911422 -3.579016 0.845021 1
3 -2.911422 -3.579016 0.845021 1
4 -2.911422 -3.579016 0.845021 1
5 -2.911422 -3.579016 0.845021 1
6 -2.911422 -3.579016 0.845021 1
7 -2.911422 -3.579016 0.845021 1
8 -2.911422 -3.579016 0.845021 1
[Parallel(n_jobs=-1)]: Done 45 out of 45 | elapsed: 3.2s finished
我的问题是,为什么网格搜索 return 所有数据拆分的结果完全相似?
我的假设是,网格搜索似乎只对所有数据拆分执行 1 个参数网格(例如 {'max_depth': 1, 'n_estimators': 10})。
如果是这样,为什么会这样?
最后,如何使网格搜索能够return所有数据拆分的正确结果?
您的 set_params
方法实际上并没有更改 self.model
属性中 RandomForestRegressor
实例的超参数。相反,它直接将属性设置为您的 RandomForest
实例(之前不存在,并且不影响实际模型!)。所以网格搜索反复设置这些无关紧要的新参数,每次实际拟合的模型都是一样的。 (同理,get_params
方法获取的是RandomForest
属性,与RandomForestRegressor
属性不一样。)
你应该能够通过让 set_params
调用 self.model.set_params
(并让 get_params
使用 self.model.<parameter_name>
而不是 self.<parameter_name>
来解决大部分问题.
我认为还有另一个问题,但我不知道您的示例是如何运行的:您使用 self.<parameter_name>
实例化了 model
属性,但从未在 __init__
.
我将 Scikit-Learn 随机森林模型包装在一个函数中,如下所示:
from sklearn.base import BaseEstimator, RegressorMixin
class Model(BaseEstimator, RegressorMixin):
def __init__(self, model):
self.model = model
def fit(self, X, y):
self.model.fit(X, y)
return self
def score(self, X, y):
from sklearn.metrics import mean_squared_error
return mean_squared_error(y_true=y,
y_pred=self.model.predict(X),
squared=False)
def predict(self, X):
return self.model.predict(X)
class RandomForest(Model):
def __init__(self, n_estimators=100,
max_depth=None, min_samples_split=2,
min_samples_leaf=1, max_features=None):
self.n_estimators=n_estimators
self.max_depth=max_depth
self.min_samples_split=min_samples_split
self.min_samples_leaf=min_samples_leaf
self.max_features=max_features
from sklearn.ensemble import RandomForestRegressor
self.model = RandomForestRegressor(n_estimators=self.n_estimators,
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
max_features=self.max_features,
random_state = 777)
def get_params(self, deep=True):
return {"n_estimators": self.n_estimators,
"max_depth": self.max_depth,
"min_samples_split": self.min_samples_split,
"min_samples_leaf": self.min_samples_leaf,
"max_features": self.max_features}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self
我主要遵循 Scikit-Learn 官方指南,可以在 https://scikit-learn.org/stable/developers/develop.html
找到这是我的网格搜索的样子:
grid_search = GridSearchCV(estimator=RandomForest(),
param_grid={'max_depth':[1, 3, 6], 'n_estimators':[10, 100, 300]},
n_jobs=-1,
scoring='neg_root_mean_squared_error',
cv=5, verbose=True).fit(X, y)
print(pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score'))
网格搜索输出结果和grid_search.cv_results_打印如下
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
mean_fit_time std_fit_time mean_score_time std_score_time \
0 0.210918 0.002450 0.016754 0.000223
1 0.207049 0.001675 0.016579 0.000147
2 0.206495 0.002001 0.016598 0.000158
3 0.206799 0.002417 0.016740 0.000144
4 0.207534 0.001603 0.016668 0.000269
5 0.206384 0.001396 0.016605 0.000136
6 0.220052 0.024280 0.017247 0.001137
7 0.226838 0.027507 0.017351 0.000979
8 0.205738 0.003420 0.016246 0.000626
param_max_depth param_n_estimators params \
0 1 10 {'max_depth': 1, 'n_estimators': 10}
1 1 100 {'max_depth': 1, 'n_estimators': 100}
2 1 300 {'max_depth': 1, 'n_estimators': 300}
3 3 10 {'max_depth': 3, 'n_estimators': 10}
4 3 100 {'max_depth': 3, 'n_estimators': 100}
5 3 300 {'max_depth': 3, 'n_estimators': 300}
6 6 10 {'max_depth': 6, 'n_estimators': 10}
7 6 100 {'max_depth': 6, 'n_estimators': 100}
8 6 300 {'max_depth': 6, 'n_estimators': 300}
split0_test_score split1_test_score split2_test_score split3_test_score \
0 -5.246725 -3.200585 -3.326962 -3.209387
1 -5.246725 -3.200585 -3.326962 -3.209387
2 -5.246725 -3.200585 -3.326962 -3.209387
3 -5.246725 -3.200585 -3.326962 -3.209387
4 -5.246725 -3.200585 -3.326962 -3.209387
5 -5.246725 -3.200585 -3.326962 -3.209387
6 -5.246725 -3.200585 -3.326962 -3.209387
7 -5.246725 -3.200585 -3.326962 -3.209387
8 -5.246725 -3.200585 -3.326962 -3.209387
split4_test_score mean_test_score std_test_score rank_test_score
0 -2.911422 -3.579016 0.845021 1
1 -2.911422 -3.579016 0.845021 1
2 -2.911422 -3.579016 0.845021 1
3 -2.911422 -3.579016 0.845021 1
4 -2.911422 -3.579016 0.845021 1
5 -2.911422 -3.579016 0.845021 1
6 -2.911422 -3.579016 0.845021 1
7 -2.911422 -3.579016 0.845021 1
8 -2.911422 -3.579016 0.845021 1
[Parallel(n_jobs=-1)]: Done 45 out of 45 | elapsed: 3.2s finished
我的问题是,为什么网格搜索 return 所有数据拆分的结果完全相似?
我的假设是,网格搜索似乎只对所有数据拆分执行 1 个参数网格(例如 {'max_depth': 1, 'n_estimators': 10})。 如果是这样,为什么会这样?
最后,如何使网格搜索能够return所有数据拆分的正确结果?
您的 set_params
方法实际上并没有更改 self.model
属性中 RandomForestRegressor
实例的超参数。相反,它直接将属性设置为您的 RandomForest
实例(之前不存在,并且不影响实际模型!)。所以网格搜索反复设置这些无关紧要的新参数,每次实际拟合的模型都是一样的。 (同理,get_params
方法获取的是RandomForest
属性,与RandomForestRegressor
属性不一样。)
你应该能够通过让 set_params
调用 self.model.set_params
(并让 get_params
使用 self.model.<parameter_name>
而不是 self.<parameter_name>
来解决大部分问题.
我认为还有另一个问题,但我不知道您的示例是如何运行的:您使用 self.<parameter_name>
实例化了 model
属性,但从未在 __init__
.