imblearn 管道是否关闭测试采样?
Does imblearn pipeline turn off sampling for testing?
让我们假设以下代码(来自imblearn example on pipelines)
...
# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
# Create the classifier
knn = KNN(1)
# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)
# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)
pipeline.fit(X_train, y_train)
y_hat = pipeline.predict(X_test)
我想确保在执行pipeline.predict(X_test)
时不会执行采样程序enn
和renn
(当然pca
必须是已执行)。
First, it is clear to me that over-, under-, and mixed-sampling
are
procedures to be applied to the training set
, not to the
test/validation set
. Please correct me here if I am wrong.
I browsed though the imblearn Pipeline
code but I could not find
the predict
method there.
I also would like to be sure that this correct behavior works when
the pipeline is inside a GridSearchCV
我只是需要一些保证,这就是 imblearn.Pipeline
发生的情况。
编辑:2020-08-28
@wundermahn 的回答就是我所需要的。
此编辑只是为了补充一点,这就是为什么应该使用 imblearn.Pipeline
进行不平衡预处理而不是 sklearn.Pipeline
在 imblearn
文档中我找不到解释的任何地方为什么在有 sklearn.Pipeline
时需要 imblearn.Pipeline
好问题。要按照您发布的顺序浏览它们:
- First, it is clear to me that over-, under-, and mixed-sampling are procedures to be applied to the training set, not to the
test/validation set. Please correct me here if I am wrong.
没错。您当然不想测试(无论是在您的 test
还是 validation
数据上) 不 代表实际的、现场的、“生产”的数据“数据集。您真的应该只将其应用于培训。请注意,如果您使用 cross-fold 验证之类的技术,则应将采样单独应用于每个折叠,如 this IEEE paper 所示。
- I browsed though the imblearn Pipeline code but I could not find the predict method there.
我假设您找到了 imblearn.pipeline
source code,如果您找到了,您要做的就是查看 fit_predict
方法:
@if_delegate_has_method(delegate="_final_estimator")
def fit_predict(self, X, y=None, **fit_params):
"""Apply `fit_predict` of last step in pipeline after transforms.
Applies fit_transforms of a pipeline to the data, followed by the
fit_predict method of the final estimator in the pipeline. Valid
only if the final estimator implements fit_predict.
Parameters
----------
X : iterable
Training data. Must fulfill input requirements of first step of
the pipeline.
y : iterable, default=None
Training targets. Must fulfill label requirements for all steps
of the pipeline.
**fit_params : dict of string -> object
Parameters passed to the ``fit`` method of each step, where
each parameter name is prefixed such that parameter ``p`` for step
``s`` has key ``s__p``.
Returns
-------
y_pred : ndarray of shape (n_samples,)
The predicted target.
"""
Xt, yt, fit_params = self._fit(X, y, **fit_params)
with _print_elapsed_time('Pipeline',
self._log_message(len(self.steps) - 1)):
y_pred = self.steps[-1][-1].fit_predict(Xt, yt, **fit_params)
return y_pred
在这里,我们可以看到 pipeline
在管道中使用了最终估计器的 .predict
方法,在您发布的示例中,scikit-learn's knn
:
def predict(self, X):
"""Predict the class labels for the provided data.
Parameters
----------
X : array-like of shape (n_queries, n_features), \
or (n_queries, n_indexed) if metric == 'precomputed'
Test samples.
Returns
-------
y : ndarray of shape (n_queries,) or (n_queries, n_outputs)
Class labels for each data sample.
"""
X = check_array(X, accept_sparse='csr')
neigh_dist, neigh_ind = self.kneighbors(X)
classes_ = self.classes_
_y = self._y
if not self.outputs_2d_:
_y = self._y.reshape((-1, 1))
classes_ = [self.classes_]
n_outputs = len(classes_)
n_queries = _num_samples(X)
weights = _get_weights(neigh_dist, self.weights)
y_pred = np.empty((n_queries, n_outputs), dtype=classes_[0].dtype)
for k, classes_k in enumerate(classes_):
if weights is None:
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
else:
mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1)
mode = np.asarray(mode.ravel(), dtype=np.intp)
y_pred[:, k] = classes_k.take(mode)
if not self.outputs_2d_:
y_pred = y_pred.ravel()
return y_pred
- I also would like to be sure that this correct behaviour works when the pipeline is inside a GridSearchCV
这种假设以上两个假设都是正确的,我认为这意味着你想要一个 complete, minimal, reproducible example of this working in a GridSearchCV. There is extensive documentation from scikit-learn
on this,但我使用 knn
创建的示例如下:
import pandas as pd, numpy as np
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
param_grid = [
{
'classification__n_neighbors': [1,3,5,7,10],
}
]
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.20)
pipe = Pipeline([
('sampling', SMOTE()),
('classification', KNeighborsClassifier())
])
grid = GridSearchCV(pipe, param_grid=param_grid)
grid.fit(X_train, y_train)
mean_scores = np.array(grid.cv_results_['mean_test_score'])
print(mean_scores)
# [0.98051926 0.98121129 0.97981998 0.98050474 0.97494193]
你的直觉很准确,干得好:)
让我们假设以下代码(来自imblearn example on pipelines)
...
# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
# Create the classifier
knn = KNN(1)
# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)
# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)
pipeline.fit(X_train, y_train)
y_hat = pipeline.predict(X_test)
我想确保在执行pipeline.predict(X_test)
时不会执行采样程序enn
和renn
(当然pca
必须是已执行)。
First, it is clear to me that
over-, under-, and mixed-sampling
are procedures to be applied to thetraining set
, not to thetest/validation set
. Please correct me here if I am wrong.I browsed though the
imblearn Pipeline
code but I could not find thepredict
method there.I also would like to be sure that this correct behavior works when the pipeline is inside a
GridSearchCV
我只是需要一些保证,这就是 imblearn.Pipeline
发生的情况。
编辑:2020-08-28
@wundermahn 的回答就是我所需要的。
此编辑只是为了补充一点,这就是为什么应该使用 imblearn.Pipeline
进行不平衡预处理而不是 sklearn.Pipeline
在 imblearn
文档中我找不到解释的任何地方为什么在有 sklearn.Pipeline
imblearn.Pipeline
好问题。要按照您发布的顺序浏览它们:
- First, it is clear to me that over-, under-, and mixed-sampling are procedures to be applied to the training set, not to the test/validation set. Please correct me here if I am wrong.
没错。您当然不想测试(无论是在您的 test
还是 validation
数据上) 不 代表实际的、现场的、“生产”的数据“数据集。您真的应该只将其应用于培训。请注意,如果您使用 cross-fold 验证之类的技术,则应将采样单独应用于每个折叠,如 this IEEE paper 所示。
- I browsed though the imblearn Pipeline code but I could not find the predict method there.
我假设您找到了 imblearn.pipeline
source code,如果您找到了,您要做的就是查看 fit_predict
方法:
@if_delegate_has_method(delegate="_final_estimator")
def fit_predict(self, X, y=None, **fit_params):
"""Apply `fit_predict` of last step in pipeline after transforms.
Applies fit_transforms of a pipeline to the data, followed by the
fit_predict method of the final estimator in the pipeline. Valid
only if the final estimator implements fit_predict.
Parameters
----------
X : iterable
Training data. Must fulfill input requirements of first step of
the pipeline.
y : iterable, default=None
Training targets. Must fulfill label requirements for all steps
of the pipeline.
**fit_params : dict of string -> object
Parameters passed to the ``fit`` method of each step, where
each parameter name is prefixed such that parameter ``p`` for step
``s`` has key ``s__p``.
Returns
-------
y_pred : ndarray of shape (n_samples,)
The predicted target.
"""
Xt, yt, fit_params = self._fit(X, y, **fit_params)
with _print_elapsed_time('Pipeline',
self._log_message(len(self.steps) - 1)):
y_pred = self.steps[-1][-1].fit_predict(Xt, yt, **fit_params)
return y_pred
在这里,我们可以看到 pipeline
在管道中使用了最终估计器的 .predict
方法,在您发布的示例中,scikit-learn's knn
:
def predict(self, X):
"""Predict the class labels for the provided data.
Parameters
----------
X : array-like of shape (n_queries, n_features), \
or (n_queries, n_indexed) if metric == 'precomputed'
Test samples.
Returns
-------
y : ndarray of shape (n_queries,) or (n_queries, n_outputs)
Class labels for each data sample.
"""
X = check_array(X, accept_sparse='csr')
neigh_dist, neigh_ind = self.kneighbors(X)
classes_ = self.classes_
_y = self._y
if not self.outputs_2d_:
_y = self._y.reshape((-1, 1))
classes_ = [self.classes_]
n_outputs = len(classes_)
n_queries = _num_samples(X)
weights = _get_weights(neigh_dist, self.weights)
y_pred = np.empty((n_queries, n_outputs), dtype=classes_[0].dtype)
for k, classes_k in enumerate(classes_):
if weights is None:
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
else:
mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1)
mode = np.asarray(mode.ravel(), dtype=np.intp)
y_pred[:, k] = classes_k.take(mode)
if not self.outputs_2d_:
y_pred = y_pred.ravel()
return y_pred
- I also would like to be sure that this correct behaviour works when the pipeline is inside a GridSearchCV
这种假设以上两个假设都是正确的,我认为这意味着你想要一个 complete, minimal, reproducible example of this working in a GridSearchCV. There is extensive documentation from scikit-learn
on this,但我使用 knn
创建的示例如下:
import pandas as pd, numpy as np
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
param_grid = [
{
'classification__n_neighbors': [1,3,5,7,10],
}
]
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.20)
pipe = Pipeline([
('sampling', SMOTE()),
('classification', KNeighborsClassifier())
])
grid = GridSearchCV(pipe, param_grid=param_grid)
grid.fit(X_train, y_train)
mean_scores = np.array(grid.cv_results_['mean_test_score'])
print(mean_scores)
# [0.98051926 0.98121129 0.97981998 0.98050474 0.97494193]
你的直觉很准确,干得好:)