包括 Scaling 和 PCA 作为 GridSearchCV 的参数

Question

我想运行使用 GridSearchCV 的逻辑回归，但我想对比使用 Scaling 和 PCA 时的性能，所以我不想在所有情况下都使用它。

我基本上想将 PCA 和缩放作为 GridSearchCV

的“参数”

我知道我可以制作这样的管道：

mnl = LogisticRegression(fit_intercept=True, multi_class="multinomial")

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('mnl', mnl)])

params_mnl = {'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
              'mnl__max_iter':[500,1000,2000,3000]}

问题是，在这种情况下，缩放将应用于所有折叠，对吗？有没有办法让它“包含”在网格搜索中？

编辑：

我刚刚阅读，尽管它与我想要的相似，但实际上并非如此，因为在那种情况下，Scaler 将应用于 GridSearch 中的最佳估计器。

我想做的是，比方说

params_mnl = {'mnl__solver': ['newton-cg', 'lbfgs']}

我想运行使用 Scaler+newton-cg、No Scaler+newton-cg、Scaler+lbfgs、No Scaler+lbfgs 进行回归。

Answer 1

可以将StandardScaler()的参数with_mean和with_std设置为False，代表不标准化。在GirdSearchCV中，参数para_grid可以设置为

param_grid = [{'scale__with_mean': [False],
               'scale__with_std': [False],
               'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
               'mnl__max_iter':[500,1000,2000,3000]
              },
              {'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
              'mnl__max_iter':[500,1000,2000,3000]}
]

然后列表中的第一个字典是“No Scaler+mnl”，第二个是“Scaler+mnl”

参考：

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

编辑：我认为如果您还考虑转 on/off PCA 会很复杂...也许您需要定义一个派生原始 PCA 的自定义 PCA。然后定义额外的布尔参数来确定是否应该执行 PCA...

class MYPCA(PCA):
    def __init__(self, PCA_turn_on, *args):
        super().__init__(*args)
        self.PCA_turn_on = PCA_turn_on
    
    def fit(X, y=None):
        if (PCA_turn_on == True):
            return super().fit(X, y=None)
        else:
            pass

    # same for other methods defined in PCA

Answer 2

来自 the documentation Pipeline:

A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to ‘passthrough’ or None.

例如：

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('mnl', mnl),
])

params = {
    'scale': ['passthrough', StandardScaler()],
    'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'mnl__max_iter': [500, 1000, 2000, 3000],
}

包括 Scaling 和 PCA 作为 GridSearchCV 的参数

Including Scaling and PCA as parameter of GridSearchCV

python

pipeline

regression

scikit-learn

grid-search