网格搜索 Returns 给定自定义模型的结果完全相同

Question

我将 Scikit-Learn 随机森林模型包装在一个函数中，如下所示：

from sklearn.base import BaseEstimator, RegressorMixin

class Model(BaseEstimator, RegressorMixin):
    def __init__(self, model):
        self.model = model
    
    def fit(self, X, y):
        self.model.fit(X, y)
        
        return self
    
    def score(self, X, y):
           
        from sklearn.metrics import mean_squared_error
        
        return mean_squared_error(y_true=y, 
                                  y_pred=self.model.predict(X), 
                                  squared=False)
    
    def predict(self, X):
        return self.model.predict(X)

class RandomForest(Model):
    def __init__(self, n_estimators=100, 
                 max_depth=None, min_samples_split=2,
                 min_samples_leaf=1, max_features=None):
        
        self.n_estimators=n_estimators 
        self.max_depth=max_depth
        self.min_samples_split=min_samples_split
        self.min_samples_leaf=min_samples_leaf
        self.max_features=max_features
           
        from sklearn.ensemble import RandomForestRegressor
 
        self.model = RandomForestRegressor(n_estimators=self.n_estimators, 
                                           max_depth=self.max_depth, 
                                           min_samples_split=self.min_samples_split,
                                           min_samples_leaf=self.min_samples_leaf, 
                                           max_features=self.max_features,
                                           random_state = 777)
    
    
    def get_params(self, deep=True):
        return {"n_estimators": self.n_estimators,
                "max_depth": self.max_depth,
                "min_samples_split": self.min_samples_split,
                "min_samples_leaf": self.min_samples_leaf,
                "max_features": self.max_features}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

我主要遵循 Scikit-Learn 官方指南，可以在 https://scikit-learn.org/stable/developers/develop.html

找到

这是我的网格搜索的样子：

grid_search = GridSearchCV(estimator=RandomForest(), 
                            param_grid={'max_depth':[1, 3, 6], 'n_estimators':[10, 100, 300]},
                            n_jobs=-1, 
                            scoring='neg_root_mean_squared_error',
                            cv=5, verbose=True).fit(X, y)
    
print(pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score'))

网格搜索输出结果和grid_search.cv_results_打印如下

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       0.210918      0.002450         0.016754        0.000223   
1       0.207049      0.001675         0.016579        0.000147   
2       0.206495      0.002001         0.016598        0.000158   
3       0.206799      0.002417         0.016740        0.000144   
4       0.207534      0.001603         0.016668        0.000269   
5       0.206384      0.001396         0.016605        0.000136   
6       0.220052      0.024280         0.017247        0.001137   
7       0.226838      0.027507         0.017351        0.000979   
8       0.205738      0.003420         0.016246        0.000626   

  param_max_depth param_n_estimators                                 params  \
0               1                 10   {'max_depth': 1, 'n_estimators': 10}   
1               1                100  {'max_depth': 1, 'n_estimators': 100}   
2               1                300  {'max_depth': 1, 'n_estimators': 300}   
3               3                 10   {'max_depth': 3, 'n_estimators': 10}   
4               3                100  {'max_depth': 3, 'n_estimators': 100}   
5               3                300  {'max_depth': 3, 'n_estimators': 300}   
6               6                 10   {'max_depth': 6, 'n_estimators': 10}   
7               6                100  {'max_depth': 6, 'n_estimators': 100}   
8               6                300  {'max_depth': 6, 'n_estimators': 300}   

   split0_test_score  split1_test_score  split2_test_score  split3_test_score  \
0          -5.246725          -3.200585          -3.326962          -3.209387   
1          -5.246725          -3.200585          -3.326962          -3.209387   
2          -5.246725          -3.200585          -3.326962          -3.209387   
3          -5.246725          -3.200585          -3.326962          -3.209387   
4          -5.246725          -3.200585          -3.326962          -3.209387   
5          -5.246725          -3.200585          -3.326962          -3.209387   
6          -5.246725          -3.200585          -3.326962          -3.209387   
7          -5.246725          -3.200585          -3.326962          -3.209387   
8          -5.246725          -3.200585          -3.326962          -3.209387   

   split4_test_score  mean_test_score  std_test_score  rank_test_score  
0          -2.911422        -3.579016        0.845021                1  
1          -2.911422        -3.579016        0.845021                1  
2          -2.911422        -3.579016        0.845021                1  
3          -2.911422        -3.579016        0.845021                1  
4          -2.911422        -3.579016        0.845021                1  
5          -2.911422        -3.579016        0.845021                1  
6          -2.911422        -3.579016        0.845021                1  
7          -2.911422        -3.579016        0.845021                1  
8          -2.911422        -3.579016        0.845021                1  
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    3.2s finished

我的问题是，为什么网格搜索 return 所有数据拆分的结果完全相似？

我的假设是，网格搜索似乎只对所有数据拆分执行 1 个参数网格（例如 {'max_depth': 1, 'n_estimators': 10}）。如果是这样，为什么会这样？

最后，如何使网格搜索能够return所有数据拆分的正确结果？

Answer 1

您的 set_params 方法实际上并没有更改 self.model 属性中 RandomForestRegressor 实例的超参数。相反，它直接将属性设置为您的 RandomForest 实例（之前不存在，并且不影响实际模型！）。所以网格搜索反复设置这些无关紧要的新参数，每次实际拟合的模型都是一样的。（同理，get_params方法获取的是RandomForest属性，与RandomForestRegressor属性不一样。）

你应该能够通过让 set_params 调用 self.model.set_params（并让 get_params 使用 self.model.<parameter_name> 而不是 self.<parameter_name> 来解决大部分问题.

我认为还有另一个问题，但我不知道您的示例是如何运行的：您使用 self.<parameter_name> 实例化了 model 属性，但从未在 __init__.

网格搜索 Returns 给定自定义模型的结果完全相同

Grid Search Returns the Exactly Same Result Given a Custom Model

python

machine-learning

scikit-learn

grid-search

gridsearchcv