使用 Joblib 时返回 scikit-learn 对象

returning scikit-learn object while using Joblib

我有一个 numpy 数组,我正在使用 sklearn 沿第一个轴转换数组。我还想将转换器对象保存在字典中,以便稍后在代码中使用。 这是我的代码:

scalers_dict = {}
for i in range(train_data_numpy.shape[1]):
    for j in range(train_data_numpy.shape[2]):
        scaler = QuantileTransformer(n_quantiles=60000, output_distribution='uniform')
        train_data_numpy[:,i,j] = scaler.fit_transform(train_data_numpy[:,i,j].reshape(-1,1)).reshape(-1)
        scalers_dict[(i,j)] = scaler
        

我的 train_data_numpy 的形状是 (60000, 28,28)。问题是这需要很长时间来处理(train_data_numpy 是 MNIST 数据集)。我有一个 16 核的 AMD Ryzen 5950X,我想并行化这段代码。

我知道例如我可以写这样的东西:

Parallel(n_jobs=16)(delayed(QuantileTransformer(n_quantiles=60000, output_distribution='uniform').fit_transform)(train_data_numpy[:,i,j].reshape(-1,1)) for j in range(train_data_numpy.shape[2]))

但这不是 return 缩放器对象,我不知道如何利用 Joblib 完成这项任务。

您可以使用 Dask-ML which is implemented on the top of Dask Library,但它与 scikit-learn 兼容。

Installation:

conda install -c conda-forge dask-ml

or

pip install dask-ml

例子

import time
from sklearn.datasets import make_classification
from sklearn.preprocessing import QuantileTransformer as skQT
from dask_ml.preprocessing import QuantileTransformer as daskQT

# toy big dataset for testing
X, y = make_classification(n_samples=1000000, n_features=100, random_state=2021)

# Comparison

scaler = skQT()
start_ = time.time()
scaler.fit_transform(X)
end_ = time.time() - start_
print("No Parallelism -- Time Elapsed: {}".format(end_))


# Using Dask ML
scaler = daskQT()
start_ = time.time()
scaler.fit_transform(X)
end_ = time.time() - start_
print("With Parallelism -- Time Elapsed: {}".format(end_))

结果

No Parallelism -- Time Elapsed: 18.680
With Parallelism -- Time Elapsed: 2.982

我的设备规格:

Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz

Number of Cores: 12