如何通过检查附加模型来创建 for 循环

Question

我有一个模型列表，我在 for 循环中迭代这些模型以获得它们的性能。我已经将 catboost 添加到我的模型列表中，但是当我尝试将它的最佳估计器添加到字典中时，它给我一个错误，其他模型都没有给我 (TypeError: unhashable type: 'CatBoostRegressor')。谷歌搜索，我看不出解决这个错误的明确方法，所以我一直在尝试将 if 语句添加到我的 for 循环中，如果模型是 catboost，则忽略将它的最佳估计器放入字典中。

我运行的代码示例是这样的：

lgbm = LGBMRegressor(random_state=seed)
lgbm_params = {
    "max_depth": (1, 4),
    "learning_rate": (0.01, 0.2, "log-uniform"),
    "n_estimators": (10, 50),
    "reg_alpha": (1, 10, "log-uniform"),
    "reg_lambda": (1, 10, "log-uniform"),
}

catboost = CatBoostRegressor(random_seed=seed, verbose=False)
cat_params = {
     "iterations": (10, 50),
     'learning_rate': (0.01, 0.2, 'log-uniform'), 
     'depth':  (1, 4), 
}

inner_cv = KFold(n_splits=2, random_state=seed)
outer_cv = KFold(n_splits=2, random_state=seed)

models = []

models.append(("CB", BayesSearchCV(catboost, cat_params, cv=inner_cv, iid=False, n_jobs=1)))
models.append(("LGBM", BayesSearchCV(lgbm, lgbm_params, cv=inner_cv, iid=False, n_jobs=1)))


results = []
names = []
medians =[]
scoring = ['r2', 'neg_mean_squared_error', 'max_error', 'neg_mean_absolute_error',
          'explained_variance','neg_root_mean_squared_error',
           'neg_median_absolute_error'] 

models_dictionary_r2 = {}
models_dictionary_mse = {}

for name, model in models:

    #run nested cross-validation

    nested_cv_results = model_selection.cross_validate(model, X , Y, cv=outer_cv, scoring=scoring, error_score="raise")
    nested_cv_results2 = model_selection.cross_val_score(model, X , Y, cv=outer_cv, scoring='r2', error_score="raise")
    results.append(nested_cv_results2)
    names.append(name)
    medians.append(np.median(nested_cv_results['test_r2']))
    print(name, 'Nested CV results for all scores:', '\n', nested_cv_results, '\n')
    print(name, 'r2 Nested CV Median', np.median(nested_cv_results['test_r2']))
    print(name, 'MSE Nested CV Median', np.median(nested_cv_results['test_neg_mean_squared_error'] ))

    #view best tuned model

    model.fit(X_train, Y_train)
    print("Best Parameters: \n{}\n".format(model.best_params_))
    y_pred_train = model.best_estimator_.predict(X_train)
    y_pred = model.best_estimator_.predict(X_test)
 
    #view shap interpretation of best tuned model

    explainer = shap.TreeExplainer(model.best_estimator_)
    shap_values = explainer.shap_values(X_importance)
    X_importance = pd.DataFrame(data=X_test, columns=df3.columns)
    print(name,'ALL FEATURES Ranked SHAP Importance:', X.columns[np.argsort(np.abs(shap_values).mean(0))[::-1]])
    fig, ax = plt.subplots()
    shap.summary_plot(shap_values, X_importance)
    fig.savefig("shap_summary" + name +".svg", format='svg', dpi=1200, bbox_inches = "tight")

    #add model's best estimator's best metrics to a dictionary, but ignore this for catboost 

    if model is models[0]: 
        print('catboost best estimator not compatible with entering a dictionary')
    else:
        models_dictionary_r2[model.best_estimator_] = np.median(nested_cv_results['test_r2'])
        models_dictionary_mse[model.best_estimator_] = np.median(nested_cv_results['test_neg_mean_squared_error']

这是我试图开始工作的末尾的 if 语句，但我没有将 python 与条件语句一起使用的经验。目前它运行并仍然发送 catboost 模型以尝试将其结果放入字典中，我得到相同的 TypeError: unhashable type: 'CatBoostRegressor' - 有没有一种方法可以编码 'if model is catboost then move on to test the next model, else store best estimator results in dictionaries'?

很遗憾，我无法提供我的数据，但它只是连续变量的 8 个特征，回归模型的得分行在 0-1 之间。

编辑：我这样做是为了获得表现最好的模型的最佳估计器，这样我就可以将该 specific/tuned 模型与新数据相匹配。

我从字典中单独取出它以适应新数据，如下所示：

top_model = max(models_dictionary_r2, key=models_dictionary_r2.get)

根据目前的答案，我运行得到一个输出如下的列表：

[(<catboost.core.CatBoostRegressor at 0x7f8d50860400>, 0.8110325480633154),
 (LGBMRegressor(learning_rate=0.14567200981008144, max_depth=3, n_estimators=50,
                random_state=0, reg_alpha=1, reg_lambda=1),
  0.7632660705322947)]

Catboost 在此列表中具有最佳中值 r2，但我不确定 catboost 是否采用正确的格式以使其最佳估计器详细信息适合新数据？我试过了：

top_model = models_list_predr2[0]
top_model.fit(X_train, Y_train)
AttributeError: 'tuple' object has no attribute 'fit'

我如何从该列表中提取出性能最佳模型的 best_estimator_ 并确保它适用于 catboost？

我在 python 方面没有经验，使用上面的列表尝试 max(models_list_predr2) 也会出现错误 TypeError: '>' not supported between instances of 'LGBMRegressor' and 'CatBoostRegressor'

Answer 1

发生此错误是因为任何字典键都应属于可哈希类型，这意味着它应该实现 __hash__() 用于哈希方法和 __eq__() 用于比较。

由于 CatBoostRegressor 没有实现这些方法，您在尝试将 CatBoostRegressor 作为键添加到字典中时收到异常。

对于 models_dictionary_r2 和 models_dictionary_mse，我建议您使用列表而不是字典。

models_list_r2 = []
models_list_mse = []

然后您可以像这样向这些列表添加值：

best_estimator = model.best_estimator_
median_r2 = np.median(nested_cv_results['test_r2'])
models_list_r2.append((best_estimator,  median_r2))

median_mse = np.median(nested_cv_results['test_neg_mean_squared_error'])
models_list_mse.append((model.best_estimator_, median_mse))

到select最高R-squared的模型可以添加以下代码：

best_model, best_r2 = sorted(models_list_r2, key = lambda x: x[1], reverse=True)[0]

如何通过检查附加模型来创建 for 循环

How to create a for loop with checking appended models

python

if-statement

scikit-learn

catboost