如何确定最佳基线模型以在 scikit 学习中执行超参数调整？

Question

我正在处理数据，我正在尝试不同的分类算法，看看哪种算法作为基线模型表现最好。其代码如下：

# Trying out different classifiers and selecting the best

## Creat list of classifiers we're going to loop through
classifiers = [
    KNeighborsClassifier(),
    SVC(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

classifier_names = [
    'kNN',
    'SVC',
    'DecisionTree',
    'RandomForest',
    'AdaBoost',
    'GradientBoosting'
]

model_scores = []

## Looping through the classifiers
for classifier, name in zip(classifiers, classifier_names):
    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('selector', SelectKBest(k=len(X.columns))),
        ('classifier', classifier)])
    score = cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
    model_scores.append(score)
    print("Model score for {}: {}".format(name, score))

输出为：

Model score for kNN: 0.7472524440239673
Model score for SVC: 0.7896621728161464
Model score for DecisionTree: 0.7302148734267939
Model score for RandomForest: 0.779058799919727
Model score for AdaBoost: 0.7949635904933918
Model score for GradientBoosting: 0.7930712637252372

原来最好的模型是 AdaBoostClassifier()。我通常会选择最好的基线模型并对其执行 GridSearchCV 以进一步提高其基线性能。

但是，如果假设作为基线模型（在本例中为 AdaBoost）表现最好的模型通过超参数调整仅提高了 1%，而最初表现不佳的模型（例如 SCV())，会有更多的“潜力”，通过超参数调整来改进（例如，将提高 4%），并且在调整后最终会成为更好的模型？

有没有办法事先知道这个“潜力”，而无需对所有分类器执行 GridSearch？

Answer 1

是的，有单变量、双变量和多变量分析等方法来查看数据，然后决定您可以开始使用哪个模型作为基线。

你也可以使用sklearn的方式来选择正确的estimator。

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Answer 2

不，在调整超参数之前无法100%确定知道哪个分类器最终会在任何给定问题上表现最佳。然而，在实践中，Kaggle 竞赛在表格数据分类问题（而不是基于文本或图像的分类问题）上显示的是，几乎在每种情况下，梯度提升的基于决策树的模型（如 XGBoost or LightGBM) works best. Given this, it's likely that GradientBoosting will perform better under hyperparamter tuning since it's based off LightGBM.

你在上面的代码中所做的是简单地使用超参数的所有默认值，对于那些对超参数调整更敏感的算法，它不一定表示最终（微调）性能，正如你所建议的。

如何确定最佳基线模型以在 scikit 学习中执行超参数调整？

How to determine the best baseline model to perform hyperparameter tuning on in scikit learn?

python

scikit-learn

hyperparameters