如何在管道中找到不同 "steps" 的最佳参数?
How to find the best parameters for different "steps" in a pipeline?
我有以下结合了预处理、特征选择和估计器的管道:
## Selecting categorical and numeric features
numerical_ix = X.select_dtypes(include=np.number).columns
categorical_ix = X.select_dtypes(exclude=np.number).columns
## Create preprocessing pipelines for each datatype
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('encoder', OrdinalEncoder()),
('scaler', StandardScaler())])
## Putting the preprocessing steps together
preprocessor = ColumnTransformer([
('numerical', numerical_transformer, numerical_ix),
('categorical', categorical_transformer, categorical_ix)],
remainder='passthrough')
## Create example pipeline with kNN
example_pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('selector', SelectKBest(k=len(X.columns))), # keep the same amount of columns for now
('classifier', KNeighborsClassifier())
])
cross_val_score(example_pipe, X, y, cv=5, scoring='accuracy').mean()
我编写了以下代码,为 SelectKBest
“尝试”不同的 k
s 并绘制它。
但是我怎样才能同时在 kNN
分类器中寻找 k
的最优值? 我不知道不一定要绘制它,只是找到最佳值。我的猜测是 GridSearchCV
,但我不知道如何将其应用于管道中的不同步骤。
k_range = list(range(1, len(X.columns))) # 1 until 18
k_scores = []
for k in k_range:
example_pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('selector', SelectKBest(k=k)), # keep the same amount of columns for now
('classifier', KNeighborsClassifier())])
score = cross_val_score(example_pipe, X, y, cv=5, scoring='accuracy').mean()
k_scores.append(score)
plt.plot(k_range, k_scores)
plt.xlabel('Value of k in SelectKBEST')
plt.xticks(k_range, rotation=20)
plt.ylabel('Cross-Validated Accuracy')
对于那些感兴趣的人,输出是:
您正在寻找 KNeighborsClassifier
的最佳 n_neighbors
值。
您对使用 GridSearchCV
的猜测是正确的。如果您想了解它与管道的结合使用,请查看 Pipeline:
的文档
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’
你的情况:
example_pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('selector', SelectKBest()),
('classifier', KNeighborsClassifier())
])
param_grid = {
"selector__k": [5, 10, 15],
"classifier__n_neighbors": [3, 5, 10]
}
gs = GridSearchCV(example_pipe, param_grid=param_grid)
gs.fit(X, y)
然后用best_params_
检索最佳参数:
best_k = gs.best_params_['selector__k']
best_n_neighbors = gs.best_params_['classifier__n_neighbors']
我有以下结合了预处理、特征选择和估计器的管道:
## Selecting categorical and numeric features
numerical_ix = X.select_dtypes(include=np.number).columns
categorical_ix = X.select_dtypes(exclude=np.number).columns
## Create preprocessing pipelines for each datatype
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('encoder', OrdinalEncoder()),
('scaler', StandardScaler())])
## Putting the preprocessing steps together
preprocessor = ColumnTransformer([
('numerical', numerical_transformer, numerical_ix),
('categorical', categorical_transformer, categorical_ix)],
remainder='passthrough')
## Create example pipeline with kNN
example_pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('selector', SelectKBest(k=len(X.columns))), # keep the same amount of columns for now
('classifier', KNeighborsClassifier())
])
cross_val_score(example_pipe, X, y, cv=5, scoring='accuracy').mean()
我编写了以下代码,为 SelectKBest
“尝试”不同的 k
s 并绘制它。
但是我怎样才能同时在 kNN
分类器中寻找 k
的最优值? 我不知道不一定要绘制它,只是找到最佳值。我的猜测是 GridSearchCV
,但我不知道如何将其应用于管道中的不同步骤。
k_range = list(range(1, len(X.columns))) # 1 until 18
k_scores = []
for k in k_range:
example_pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('selector', SelectKBest(k=k)), # keep the same amount of columns for now
('classifier', KNeighborsClassifier())])
score = cross_val_score(example_pipe, X, y, cv=5, scoring='accuracy').mean()
k_scores.append(score)
plt.plot(k_range, k_scores)
plt.xlabel('Value of k in SelectKBEST')
plt.xticks(k_range, rotation=20)
plt.ylabel('Cross-Validated Accuracy')
对于那些感兴趣的人,输出是:
您正在寻找 KNeighborsClassifier
的最佳 n_neighbors
值。
您对使用 GridSearchCV
的猜测是正确的。如果您想了解它与管道的结合使用,请查看 Pipeline:
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’
你的情况:
example_pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('selector', SelectKBest()),
('classifier', KNeighborsClassifier())
])
param_grid = {
"selector__k": [5, 10, 15],
"classifier__n_neighbors": [3, 5, 10]
}
gs = GridSearchCV(example_pipe, param_grid=param_grid)
gs.fit(X, y)
然后用best_params_
检索最佳参数:
best_k = gs.best_params_['selector__k']
best_n_neighbors = gs.best_params_['classifier__n_neighbors']