sklearn 中估算器管道的参数 clf 无效

Question

任何人都可以检查以下代码的问题吗？我在构建模型的任何步骤中都错了吗？我已经在参数中添加了两个 'clf__'。

clf=RandomForestClassifier()
pca = PCA()
pca_clf = make_pipeline(pca, clf) 


kfold = KFold(n_splits=10, random_state=22)



parameters = {'clf__n_estimators': [4, 6, 9], 'clf__max_features': ['log2', 
'sqrt','auto'],'clf__criterion': ['entropy', 'gini'], 'clf__max_depth': [2, 
 3, 5, 10], 'clf__min_samples_split': [2, 3, 5],
'clf__min_samples_leaf': [1,5,8] }

grid_RF=GridSearchCV(pca_clf,param_grid=parameters,
        scoring='accuracy',cv=kfold)
grid_RF = grid_RF.fit(X_train, y_train)
clf = grid_RF.best_estimator_
clf.fit(X_train, y_train)
grid_RF.best_score_

cv_result = cross_val_score(clf,X_train,y_train, cv = kfold,scoring = 
"accuracy")

cv_result.mean()

Answer 1

您假设 make_pipeline 的用法有误。来自 the documentation：-

This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

这意味着当您提供一个 PCA 对象时，它的名称将被设置为 'pca'（小写），而当您向它提供一个 RandomForestClassifier 对象时，它将被命名为 'randomforestclassifier' ，而不是你想的 'clf'。

所以现在您创建的参数网格无效，因为它包含 clf__ 并且它不存在于管道中。

解决方案 1：

替换此行：

pca_clf = make_pipeline(pca, clf)

与

pca_clf = Pipeline([('pca', pca), ('clf', clf)])

解决方案 2：

如果您不想更改 pca_clf = make_pipeline(pca, clf) 行，请将 parameters 中出现的所有 clf 替换为 'randomforestclassifier'，如下所示：

parameters = {'randomforestclassifier__n_estimators': [4, 6, 9], 
              'randomforestclassifier__max_features': ['log2', 'sqrt','auto'],
              'randomforestclassifier__criterion': ['entropy', 'gini'], 
              'randomforestclassifier__max_depth': [2, 3, 5, 10], 
              'randomforestclassifier__min_samples_split': [2, 3, 5],
              'randomforestclassifier__min_samples_leaf': [1,5,8] }

旁注：无需在您的代码中执行此操作：

clf = grid_RF.best_estimator_
clf.fit(X_train, y_train)

best_estimator_ 已经用找到的最佳参数匹配了整个数据，所以你调用 clf.fit() 是多余的。

sklearn 中估算器管道的参数 clf 无效

Invalid parameter clf for estimator Pipeline in sklearn

python

pipeline

pca

scikit-learn

解决方案 1：

解决方案 2：