是否可以在 Neuraxle 或 sklearn 中将多个管道组合成单个估计器以创建多输出分类器并一次性适应
Is it possible to combine multiple pipeline into single estimator in Neuraxle or sklearn to create multi-output classifer and fit in one go
我想创建多输出分类器。但是,我的问题是每个输出的正标签分布差异很大,例如对于输出 1,有 2% 的正标签,对于输出 2,有 20% 的正标签。因此,我想将每个输出的数据采样和模型拟合分离到多个流(多个子流水线)中,每个子流水线分别执行过采样,并且过采样和分类器的超参数也分别进行优化。
例如,假设我有
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
X = # some input features array here
y = np.array([[0,1],
[0,1],
[0,0],
[1,0],
[0,0]]) # unbalance label distribution
y_1 = y[:, 0]
y_2 = y[:, 1]
param_grid_shared = {'oversampler__sampling_strategy': [0.2, 0.4, 0.5], 'logit__C': [1, 0.1, 0.01]}
pipeline_output_1 = Pipeline([('oversampler', SMOTE()), ('logit', LogisticRegression())])
grid_1 = GridSearchCV(pipeline_output_1, param_grid_shared)
grid_1.fit(X, y_1)
pipeline_output_2 = Pipeline([('oversampler', SMOTE()), ('logit', LogisticRegression())])
grid_2 = GridSearchCV(pipeline_output_2, param_grid_shared)
grid_2.fit(X, y_2)
我想把它们结合起来创造类似
的东西
multi_pipe = Pipeline([(Something to separate X and y into multiple streams)
((pipe_1, pipeline_output_1),
(pipe_2, pipeline_output_2)), # 2 pipeline optimized separately
(Evaluate and select hyperparameters for each pipeline separately)
(Something to combine output from pipeline 1 and pipeline 2)
])
在 Neuraxle 或 Sklearn 中
MultiOutputClassifier 肯定不适合这种情况,我现在不太确定在哪里寻找解决方案。
我用以下想法创建了一个 issue:
pipe_1_with_oversampler_1 = Pipeline([
Oversampler1().assert_has_services(DataRepository), Pipeline1()])
pipe_2_with_oversampler_2 = Pipeline([
Oversampler2().assert_has_services(DataRepository), Pipeline2()])
multi_pipe = Pipeline([
DataPreprocessingStep(),
# Evaluate and select hyperparameters for each pipeline separately, but within one run, using `multi_pipe.fit(...)`:
FeatureUnion([
AutoML(pipe_1_with_oversampler_1, **automl_args_1),
AutoML(pipe_2_with_oversampler_2, **automl_args_2)
]),
# And then combine output from pipeline 1 and pipeline 2 using feature union.
# Can do preprocessing and postprocessing as well.
PostprocessingStep(),
])
为此,AutoML 对象可以重构为一个常规步骤,因此可以代替一个步骤使用。
我想创建多输出分类器。但是,我的问题是每个输出的正标签分布差异很大,例如对于输出 1,有 2% 的正标签,对于输出 2,有 20% 的正标签。因此,我想将每个输出的数据采样和模型拟合分离到多个流(多个子流水线)中,每个子流水线分别执行过采样,并且过采样和分类器的超参数也分别进行优化。
例如,假设我有
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
X = # some input features array here
y = np.array([[0,1],
[0,1],
[0,0],
[1,0],
[0,0]]) # unbalance label distribution
y_1 = y[:, 0]
y_2 = y[:, 1]
param_grid_shared = {'oversampler__sampling_strategy': [0.2, 0.4, 0.5], 'logit__C': [1, 0.1, 0.01]}
pipeline_output_1 = Pipeline([('oversampler', SMOTE()), ('logit', LogisticRegression())])
grid_1 = GridSearchCV(pipeline_output_1, param_grid_shared)
grid_1.fit(X, y_1)
pipeline_output_2 = Pipeline([('oversampler', SMOTE()), ('logit', LogisticRegression())])
grid_2 = GridSearchCV(pipeline_output_2, param_grid_shared)
grid_2.fit(X, y_2)
我想把它们结合起来创造类似
的东西multi_pipe = Pipeline([(Something to separate X and y into multiple streams)
((pipe_1, pipeline_output_1),
(pipe_2, pipeline_output_2)), # 2 pipeline optimized separately
(Evaluate and select hyperparameters for each pipeline separately)
(Something to combine output from pipeline 1 and pipeline 2)
])
在 Neuraxle 或 Sklearn 中
MultiOutputClassifier 肯定不适合这种情况,我现在不太确定在哪里寻找解决方案。
我用以下想法创建了一个 issue:
pipe_1_with_oversampler_1 = Pipeline([
Oversampler1().assert_has_services(DataRepository), Pipeline1()])
pipe_2_with_oversampler_2 = Pipeline([
Oversampler2().assert_has_services(DataRepository), Pipeline2()])
multi_pipe = Pipeline([
DataPreprocessingStep(),
# Evaluate and select hyperparameters for each pipeline separately, but within one run, using `multi_pipe.fit(...)`:
FeatureUnion([
AutoML(pipe_1_with_oversampler_1, **automl_args_1),
AutoML(pipe_2_with_oversampler_2, **automl_args_2)
]),
# And then combine output from pipeline 1 and pipeline 2 using feature union.
# Can do preprocessing and postprocessing as well.
PostprocessingStep(),
])
为此,AutoML 对象可以重构为一个常规步骤,因此可以代替一个步骤使用。