将自定义函数放入 Sklearn 管道

Question

在我的分类方案中，有几个步骤包括：

SMOTE（合成少数过采样技术）
Fisher 特征标准 selection
标准化（Z分数标准化）
SVC（支持向量分类器）

上面的方案中主要要调参的参数是百分位数（2.）和SVC的超参数（4.），我想通过网格搜索进行调优。

当前解决方案构建了一个“部分”管道，包括方案中的步骤 3 和 4 clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))]) 并将方案分为两部分：

调整特征的百分位数以保持通过第一次网格搜索

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for percentile in percentiles:
        # Fisher returns the indices of the selected features specified by the parameter 'percentile'
        selected_ind = Fisher(X_train, y_train, percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

f1 分数将被存储，然后通过所有百分位数的所有折叠分区进行平均，具有最佳 CV 分数的百分位数被 returned。将 'percentile for loop' 作为内循环的目的是允许公平竞争，因为我们在所有百分位数的所有折叠分区中拥有相同的训练数据（包括合成数据）。

确定百分位数后，通过第二次网格搜索调整超参数

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for parameters in parameter_comb:
        # Select the features based on the tuned percentile
        selected_ind = Fisher(X_train, y_train, best_percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

它以非常相似的方式完成，除了我们将 SVC 的超参数而不是特征的百分位数调整为 select。

我的问题是：

在当前的解决方案中，我只涉及 clf 中的 3. 和 4. 并且如上所述在两个嵌套循环中“手动”执行 1. 和 2.。有什么方法可以将所有四个步骤都包含在一个管道中并一次完成整个过程吗？
如果可以保留第一个嵌套循环，那么是否可以（以及如何）使用单个管道简化下一个嵌套循环
```
clf_all = Pipeline([('smote', SMOTE()),
                    ('fisher', Fisher(percentile=best_percentile))
                    ('normal',preprocessing.StandardScaler()),
                    ('svc',svm.SVC(class_weight='auto'))]) 
```
并简单地使用 GridSearchCV(clf_all, parameter_comb) 进行调整？

请注意，SMOTE 和 Fisher（排名标准）都必须只对每个折叠分区中的训练数据进行。

如有任何评论，我们将不胜感激。

SMOTE和Fisher如下所示：

def Fscore(X, y, percentile=None):
    X_pos, X_neg = X[y==1], X[y==0]
    X_mean = X.mean(axis=0)
    X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
    num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    F = num/deno
    sort_F = argsort(F)[::-1]
    n_feature = (float(percentile)/100)*shape(X)[1]
    ind_feature = sort_F[:ceil(n_feature)]
    return(ind_feature)

SMOTE来自https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py，是return合成的数据。我将其修改为 return 原始输入数据与合成数据及其标签和合成数据堆叠在一起。

def smote(X, y):
    n_pos = sum(y==1), sum(y==0)
    n_syn = (n_neg-n_pos)/float(n_pos) 
    X_pos = X[y==1]
    X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
    y_syn = np.ones(shape(X_syn)[0])
    X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
    return(X, y)

Answer 1

我不知道你的 SMOTE() 和 Fisher() 函数是从哪里来的，但答案是肯定的，你绝对可以做到这一点。为此，您需要围绕这些函数编写一个包装器 class。最简单的方法是继承 sklearn 的 BaseEstimator 和 TransformerMixin classes，例如：http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

如果这对您来说没有意义，post 至少您的一个函数的详细信息（它来自的库或您自己编写的代码），我们可以从那里开始.

编辑：

抱歉，我没有足够仔细地查看您的函数，以意识到除了您的训练数据（即 X 和 y）之外，它们还转换了您的目标。管道不支持对您的目标进行转换，因此您将像原来那样先进行转换。供您参考，这里是为您的 Fisher 过程编写自定义 class 的样子，如果函数本身不需要影响您的目标变量，它就可以工作。

>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>> 
>>> class Fisher(BaseEstimator, TransformerMixin):
...     def __init__(self,percentile=0.95):
...             self.percentile = percentile
...     def fit(self, X, y):
...             from numpy import shape, argsort, ceil
...             X_pos, X_neg = X[y==1], X[y==0]
...             X_mean = X.mean(axis=0)
...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
...             F = num/deno
...             sort_F = argsort(F)[::-1]
...             n_feature = (float(self.percentile)/100)*shape(X)[1]
...             self.ind_feature = sort_F[:ceil(n_feature)]
...             return self
...     def transform(self, x):
...             return x[self.ind_feature,:]
... 
>>> 
>>> data = load_iris()
>>> 
>>> pipeline = Pipeline([
...     ('fisher', Fisher()),
...     ('normal',StandardScaler()),
...     ('svm',SVC(class_weight='auto'))
... ])
>>> 
>>> grid = {
...     'fisher__percentile':[0.75,0.50],
...     'svm__C':[1,2]
... }
>>> 
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
    (X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.

Answer 2

scikit 在 0.17 版中创建了一个 FunctionTransformer 作为预处理 class 的一部分。它的使用方式与 David 在上面的答案中实现 class Fisher 的方式类似 - 但灵活性较低。如果函数的 input/output 配置正确，转换器可以实现函数的 fit/transform/fit_transform 方法，从而允许它在 scikit 管道中使用。

例如，如果管道的输入是一个系列，则转换器将如下所示：


def trans_func(input_series):
    return output_series

from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(trans_func)

sk_pipe = Pipeline([("trans", transformer), ("vect", tf_1k), ("clf", clf_1k)])
sk_pipe.fit(train.desc, train.tag)

其中 vect 是一个 tf_idf 转换器，clf 是一个 class 转换器，train 是训练数据集。 “train.desc”是管道的系列文本输入。

Answer 3

您实际上可以将所有这些功能放入一个管道中！

在接受的答案中，@David 写道您的函数

transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were.

确实sklearn的管道不支持这个。但是 imblearn 的管道 here 支持这一点。 imblearn 管道与 sklearn 管道类似，但它允许您通过示例方法分别调用训练和测试数据的转换。此外，这些示例方法实际上是为了让您可以同时更改数据 X 和标签 y 而设计的。这很重要，因为很多时候您希望在管道中包含 smote，但您只想对训练数据进行 smote，而不是测试数据。使用 imblearn 管道，您可以在管道中调用 smote 以仅转换 X_train 和 y_train 而不是 X_test 和 y_test.

因此您可以创建一个 imblearn 管道，其中包含 smote 采样器、预处理步骤和 svc。

有关更多详细信息，请查看此堆栈溢出 post and machine learning mastery article here。

将自定义函数放入 Sklearn 管道

Put customized functions in Sklearn pipeline

pipeline

machine-learning

feature-selection

scikit-learn

cross-validation