如何在 python 中使用 GridSearchCV 比较多个模型以及管道和超参数调整
How to use GridSearchCV for comparing multiple models along with pipeline and hyper-parameter tuning in python
我正在使用两个估计器,Randomforest 和 SVM
random_forest_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('random_forest',RandomForestClassifier())
])
svm_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('svm',LinearSVC())
])
我想先对数据进行向量化,然后再使用估算器,我正在网上浏览这个 tutorial。然后我使用超参数如下
parameters=[
{
'vectorizer__max_features':[500,1000,1500],
'random_forest__min_samples_split':[50,100,250,500]
},
{
'vectorizer__max_features':[500,1000,1500],
'svm__C':[1,3,5]
}
]
并传递给 GridSearchCV
pipelines=[random_forest_pipeline,svm_pipeline]
grid_search=GridSearchCV(pipelines,param_grid=parameters,cv=3,n_jobs=-1)
grid_search.fit(x_train,y_train)
但是,当我 运行 代码时出现错误
TypeError: estimator should be an estimator implementing 'fit' method
不知道为什么会出现此错误
问题是 pipelines=[random_forest_pipeline,svm_pipeline]
是一个没有 fit
方法的列表。
即使你可以让它以这种方式工作,在某些时候 'random_forest__min_samples_split':[50,100,250,500]
也会在 svm_pipeline
中传递,这会引发错误。
ValueError: Invalid parameter svm for estimator Pipeline
您不能以这种方式混合 2 个管道,因为在某些时候您请求使用 random_forest__min_samples_split
的值评估 svm_pipeline
,这是无效的。
解法:Fit a GridSearch object for the Random forest model and another GridSearch object for the SVC model
pipelines=[random_forest_pipeline,svm_pipeline]
grid_search_1=GridSearchCV(pipelines[0],param_grid=parameters[0],cv=3,n_jobs=-1)
grid_search_1.fit(X,y)
grid_search_2=GridSearchCV(pipelines[1],param_grid=parameters[1],cv=3,n_jobs=-1)
grid_search_2.fit(X,y)
完整代码:
random_forest_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('random_forest',RandomForestClassifier())
])
svm_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('svm',LinearSVC())
])
parameters=[
{
'vectorizer__max_features':[500,1000,1500],
'random_forest__min_samples_split':[50,100,250,500]
},
{
'vectorizer__max_features':[500,1000,1500],
'svm__C':[1,3,5]
}
]
pipelines=[random_forest_pipeline,svm_pipeline]
# gridsearch only for the Random Forest model
grid_search_1 =GridSearchCV(pipelines[0],param_grid=parameters[0],cv=3,n_jobs=-1)
grid_search_1.fit(X,y)
# gridsearch only for the SVC model
grid_search_2 =GridSearchCV(pipelines[1],param_grid=parameters[1],cv=3,n_jobs=-1)
grid_search_2.fit(X,y)
编辑
如果您将模型明确定义到 param_grid
列表中,则可以根据文档进行。
来自文档的代码:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2
print(__doc__)
pipe = Pipeline([
# the reduce_dim stage is populated by the param_grid
('reduce_dim', 'passthrough'),
('classify', LinearSVC(dual=False, max_iter=10000))
])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']
grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
X, y = load_digits(return_X_y=True)
grid.fit(X, y)
根据示例 here.
,完全有可能在单个 Pipeline
/GridSearchCV
中完成
您只需明确提及管道的 scoring
方法,因为我们最初并未声明最终估算器。
示例:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
my_pipeline = Pipeline([
('vectorizer', CountVectorizer(stop_words='english')),
('clf', 'passthrough')
])
parameters = [
{
'vectorizer__max_features': [500, 1000],
'clf':[RandomForestClassifier()],
'clf__min_samples_split':[50, 100,]
},
{
'vectorizer__max_features': [500, 1000],
'clf':[LinearSVC()],
'clf__C':[1, 3]
}
]
grid_search = GridSearchCV(my_pipeline, param_grid=parameters, cv=3, n_jobs=-1, scoring='accuracy')
grid_search.fit(X, y)
grid_search.best_params_
> # {'clf': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
# criterion='gini', max_depth=None, max_features='auto',
# max_leaf_nodes=None, max_samples=None,
# min_impurity_decrease=0.0, min_impurity_split=None,
# min_samples_leaf=1, min_samples_split=100,
# min_weight_fraction_leaf=0.0, n_estimators=100,
# n_jobs=None, oob_score=False, random_state=None,
# verbose=0, warm_start=False),
# 'clf__min_samples_split': 100,
# 'vectorizer__max_features': 1000}
pd.DataFrame(grid_search.cv_results_)[['param_vectorizer__max_features',
'param_clf__min_samples_split',
'param_clf__C','mean_test_score',
'rank_test_score']]
我正在使用两个估计器,Randomforest 和 SVM
random_forest_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('random_forest',RandomForestClassifier())
])
svm_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('svm',LinearSVC())
])
我想先对数据进行向量化,然后再使用估算器,我正在网上浏览这个 tutorial。然后我使用超参数如下
parameters=[
{
'vectorizer__max_features':[500,1000,1500],
'random_forest__min_samples_split':[50,100,250,500]
},
{
'vectorizer__max_features':[500,1000,1500],
'svm__C':[1,3,5]
}
]
并传递给 GridSearchCV
pipelines=[random_forest_pipeline,svm_pipeline]
grid_search=GridSearchCV(pipelines,param_grid=parameters,cv=3,n_jobs=-1)
grid_search.fit(x_train,y_train)
但是,当我 运行 代码时出现错误
TypeError: estimator should be an estimator implementing 'fit' method
不知道为什么会出现此错误
问题是 pipelines=[random_forest_pipeline,svm_pipeline]
是一个没有 fit
方法的列表。
即使你可以让它以这种方式工作,在某些时候 'random_forest__min_samples_split':[50,100,250,500]
也会在 svm_pipeline
中传递,这会引发错误。
ValueError: Invalid parameter svm for estimator Pipeline
您不能以这种方式混合 2 个管道,因为在某些时候您请求使用 random_forest__min_samples_split
的值评估 svm_pipeline
,这是无效的。
解法:Fit a GridSearch object for the Random forest model and another GridSearch object for the SVC model
pipelines=[random_forest_pipeline,svm_pipeline]
grid_search_1=GridSearchCV(pipelines[0],param_grid=parameters[0],cv=3,n_jobs=-1)
grid_search_1.fit(X,y)
grid_search_2=GridSearchCV(pipelines[1],param_grid=parameters[1],cv=3,n_jobs=-1)
grid_search_2.fit(X,y)
完整代码:
random_forest_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('random_forest',RandomForestClassifier())
])
svm_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('svm',LinearSVC())
])
parameters=[
{
'vectorizer__max_features':[500,1000,1500],
'random_forest__min_samples_split':[50,100,250,500]
},
{
'vectorizer__max_features':[500,1000,1500],
'svm__C':[1,3,5]
}
]
pipelines=[random_forest_pipeline,svm_pipeline]
# gridsearch only for the Random Forest model
grid_search_1 =GridSearchCV(pipelines[0],param_grid=parameters[0],cv=3,n_jobs=-1)
grid_search_1.fit(X,y)
# gridsearch only for the SVC model
grid_search_2 =GridSearchCV(pipelines[1],param_grid=parameters[1],cv=3,n_jobs=-1)
grid_search_2.fit(X,y)
编辑
如果您将模型明确定义到 param_grid
列表中,则可以根据文档进行。
来自文档的代码:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2
print(__doc__)
pipe = Pipeline([
# the reduce_dim stage is populated by the param_grid
('reduce_dim', 'passthrough'),
('classify', LinearSVC(dual=False, max_iter=10000))
])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']
grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
X, y = load_digits(return_X_y=True)
grid.fit(X, y)
根据示例 here.
,完全有可能在单个Pipeline
/GridSearchCV
中完成
您只需明确提及管道的 scoring
方法,因为我们最初并未声明最终估算器。
示例:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
my_pipeline = Pipeline([
('vectorizer', CountVectorizer(stop_words='english')),
('clf', 'passthrough')
])
parameters = [
{
'vectorizer__max_features': [500, 1000],
'clf':[RandomForestClassifier()],
'clf__min_samples_split':[50, 100,]
},
{
'vectorizer__max_features': [500, 1000],
'clf':[LinearSVC()],
'clf__C':[1, 3]
}
]
grid_search = GridSearchCV(my_pipeline, param_grid=parameters, cv=3, n_jobs=-1, scoring='accuracy')
grid_search.fit(X, y)
grid_search.best_params_
> # {'clf': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
# criterion='gini', max_depth=None, max_features='auto',
# max_leaf_nodes=None, max_samples=None,
# min_impurity_decrease=0.0, min_impurity_split=None,
# min_samples_leaf=1, min_samples_split=100,
# min_weight_fraction_leaf=0.0, n_estimators=100,
# n_jobs=None, oob_score=False, random_state=None,
# verbose=0, warm_start=False),
# 'clf__min_samples_split': 100,
# 'vectorizer__max_features': 1000}
pd.DataFrame(grid_search.cv_results_)[['param_vectorizer__max_features',
'param_clf__min_samples_split',
'param_clf__C','mean_test_score',
'rank_test_score']]