在 Sklearn 管道中将 VotingClassifier 与其他分类器一起使用
Using VotingClassifier with other classifiers inside a Sklearn Pipeline
我想在 sklearn Pipeline
中使用 VotingClassifier
,我在其中定义了一组分类器 ..
我从这个问题中得到了一些直觉:构建下面的代码,但是在这个问题中,每个分类器都是在一个独立的管道中定义的。我不想在这个中使用它方式,我之前准备了一组特征,在具有不同分类器的多管道中重复生成这些特征不是一个好主意(耗时的过程)!
我怎样才能做到这一点?!
model = Pipeline([
('feat', FeatureUnion([
('tfidf', TfidfVectorizer(analyzer='char', ngram_range=(3, 5), min_df=0.01, lowercase=True, tokenizer=tokenizeTfidf)),
])),
('pip1', Pipeline([('clf1', GradientBoostingClassifier(n_estimators=1000, random_state=7))])),
('pip2', Pipeline([('clf2', SVC())])),
('pip3', Pipeline([('clf3', RandomForestClassifier())])),
('clf', VotingClassifier(estimators=["pip1", "pip2", "pip3"]))
])
clf = model.fit(X_train, y_train)
但是我得到了这个错误:
('clf', VotingClassifier(estimators=["pip1", "pip2", "pip3"])),
File "C:\Python35\lib\site-packages\imblearn\pipeline.py", line 115, in __init__
self._validate_steps()
File "C:\Python35\lib\site-packages\imblearn\pipeline.py", line 139, in _validate_steps
"(but not both) '%s' (type %s) doesn't)" % (t, type(t)))
TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or sample (but not both) 'Pipeline(memory=None,
steps=[('clf1', GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
presort='auto', random_state=7, subsample=1.0, verbose=0,
warm_start=False))])' (type <class 'imblearn.pipeline.Pipeline'>) doesn't)
我假设您想做这样的事情:
1) 使用 TfidfVectorizer 将文本数据转换为 tfidf
2) 将转换后的数据发送到 3 个估计器(GradientBoostingClassifier、SVC、RandomForestClassifier),然后使用投票获得预测。
如果是这种情况,这就是您所需要的。
model = Pipeline([
('feat', FeatureUnion([
('tfidf', TfidfVectorizer(analyzer='char',
ngram_range=(3, 5),
min_df=0.01,
lowercase=True,
tokenizer=tokenizeTfidf)),
])),
('clf', VotingClassifier(estimators=[("pip1", GradientBoostingClassifier(n_estimators=1000,
random_state=7)),
("pip2", SVC()),
("pip3", RandomForestClassifier())]))
])
此外,如果您仅使用单个 TfidfVectorizer
而未将任何其他功能与其结合使用,则甚至不需要 FeatureUnion
:
model = Pipeline([
('tfidf', TfidfVectorizer(analyzer='char',
ngram_range=(3, 5),
min_df=0.01,
lowercase=True,
tokenizer=tokenizeTfidf)),
('clf', VotingClassifier(estimators=[("pip1", GradientBoostingClassifier(n_estimators=1000,
random_state=7)),
("pip2", SVC()),
("pip3", RandomForestClassifier())]))
])
我想在 sklearn Pipeline
中使用 VotingClassifier
,我在其中定义了一组分类器 ..
我从这个问题中得到了一些直觉:
我怎样才能做到这一点?!
model = Pipeline([
('feat', FeatureUnion([
('tfidf', TfidfVectorizer(analyzer='char', ngram_range=(3, 5), min_df=0.01, lowercase=True, tokenizer=tokenizeTfidf)),
])),
('pip1', Pipeline([('clf1', GradientBoostingClassifier(n_estimators=1000, random_state=7))])),
('pip2', Pipeline([('clf2', SVC())])),
('pip3', Pipeline([('clf3', RandomForestClassifier())])),
('clf', VotingClassifier(estimators=["pip1", "pip2", "pip3"]))
])
clf = model.fit(X_train, y_train)
但是我得到了这个错误:
('clf', VotingClassifier(estimators=["pip1", "pip2", "pip3"])),
File "C:\Python35\lib\site-packages\imblearn\pipeline.py", line 115, in __init__
self._validate_steps()
File "C:\Python35\lib\site-packages\imblearn\pipeline.py", line 139, in _validate_steps
"(but not both) '%s' (type %s) doesn't)" % (t, type(t)))
TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or sample (but not both) 'Pipeline(memory=None,
steps=[('clf1', GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
presort='auto', random_state=7, subsample=1.0, verbose=0,
warm_start=False))])' (type <class 'imblearn.pipeline.Pipeline'>) doesn't)
我假设您想做这样的事情:
1) 使用 TfidfVectorizer 将文本数据转换为 tfidf 2) 将转换后的数据发送到 3 个估计器(GradientBoostingClassifier、SVC、RandomForestClassifier),然后使用投票获得预测。
如果是这种情况,这就是您所需要的。
model = Pipeline([
('feat', FeatureUnion([
('tfidf', TfidfVectorizer(analyzer='char',
ngram_range=(3, 5),
min_df=0.01,
lowercase=True,
tokenizer=tokenizeTfidf)),
])),
('clf', VotingClassifier(estimators=[("pip1", GradientBoostingClassifier(n_estimators=1000,
random_state=7)),
("pip2", SVC()),
("pip3", RandomForestClassifier())]))
])
此外,如果您仅使用单个 TfidfVectorizer
而未将任何其他功能与其结合使用,则甚至不需要 FeatureUnion
:
model = Pipeline([
('tfidf', TfidfVectorizer(analyzer='char',
ngram_range=(3, 5),
min_df=0.01,
lowercase=True,
tokenizer=tokenizeTfidf)),
('clf', VotingClassifier(estimators=[("pip1", GradientBoostingClassifier(n_estimators=1000,
random_state=7)),
("pip2", SVC()),
("pip3", RandomForestClassifier())]))
])