使用 Joblib 时返回 scikit-learn 对象
returning scikit-learn object while using Joblib
我有一个 numpy 数组,我正在使用 sklearn 沿第一个轴转换数组。我还想将转换器对象保存在字典中,以便稍后在代码中使用。
这是我的代码:
scalers_dict = {}
for i in range(train_data_numpy.shape[1]):
for j in range(train_data_numpy.shape[2]):
scaler = QuantileTransformer(n_quantiles=60000, output_distribution='uniform')
train_data_numpy[:,i,j] = scaler.fit_transform(train_data_numpy[:,i,j].reshape(-1,1)).reshape(-1)
scalers_dict[(i,j)] = scaler
我的 train_data_numpy 的形状是 (60000, 28,28)
。问题是这需要很长时间来处理(train_data_numpy 是 MNIST 数据集)。我有一个 16 核的 AMD Ryzen 5950X,我想并行化这段代码。
我知道例如我可以写这样的东西:
Parallel(n_jobs=16)(delayed(QuantileTransformer(n_quantiles=60000, output_distribution='uniform').fit_transform)(train_data_numpy[:,i,j].reshape(-1,1)) for j in range(train_data_numpy.shape[2]))
但这不是 return 缩放器对象,我不知道如何利用 Joblib 完成这项任务。
您可以使用 Dask-ML which is implemented on the top of Dask Library,但它与 scikit-learn
兼容。
conda install -c conda-forge dask-ml
or
pip install dask-ml
例子
import time
from sklearn.datasets import make_classification
from sklearn.preprocessing import QuantileTransformer as skQT
from dask_ml.preprocessing import QuantileTransformer as daskQT
# toy big dataset for testing
X, y = make_classification(n_samples=1000000, n_features=100, random_state=2021)
# Comparison
scaler = skQT()
start_ = time.time()
scaler.fit_transform(X)
end_ = time.time() - start_
print("No Parallelism -- Time Elapsed: {}".format(end_))
# Using Dask ML
scaler = daskQT()
start_ = time.time()
scaler.fit_transform(X)
end_ = time.time() - start_
print("With Parallelism -- Time Elapsed: {}".format(end_))
结果
No Parallelism -- Time Elapsed: 18.680
With Parallelism -- Time Elapsed: 2.982
我的设备规格:
Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Number of Cores: 12
我有一个 numpy 数组,我正在使用 sklearn 沿第一个轴转换数组。我还想将转换器对象保存在字典中,以便稍后在代码中使用。 这是我的代码:
scalers_dict = {}
for i in range(train_data_numpy.shape[1]):
for j in range(train_data_numpy.shape[2]):
scaler = QuantileTransformer(n_quantiles=60000, output_distribution='uniform')
train_data_numpy[:,i,j] = scaler.fit_transform(train_data_numpy[:,i,j].reshape(-1,1)).reshape(-1)
scalers_dict[(i,j)] = scaler
我的 train_data_numpy 的形状是 (60000, 28,28)
。问题是这需要很长时间来处理(train_data_numpy 是 MNIST 数据集)。我有一个 16 核的 AMD Ryzen 5950X,我想并行化这段代码。
我知道例如我可以写这样的东西:
Parallel(n_jobs=16)(delayed(QuantileTransformer(n_quantiles=60000, output_distribution='uniform').fit_transform)(train_data_numpy[:,i,j].reshape(-1,1)) for j in range(train_data_numpy.shape[2]))
但这不是 return 缩放器对象,我不知道如何利用 Joblib 完成这项任务。
您可以使用 Dask-ML which is implemented on the top of Dask Library,但它与 scikit-learn
兼容。
conda install -c conda-forge dask-ml
or
pip install dask-ml
例子
import time
from sklearn.datasets import make_classification
from sklearn.preprocessing import QuantileTransformer as skQT
from dask_ml.preprocessing import QuantileTransformer as daskQT
# toy big dataset for testing
X, y = make_classification(n_samples=1000000, n_features=100, random_state=2021)
# Comparison
scaler = skQT()
start_ = time.time()
scaler.fit_transform(X)
end_ = time.time() - start_
print("No Parallelism -- Time Elapsed: {}".format(end_))
# Using Dask ML
scaler = daskQT()
start_ = time.time()
scaler.fit_transform(X)
end_ = time.time() - start_
print("With Parallelism -- Time Elapsed: {}".format(end_))
结果
No Parallelism -- Time Elapsed: 18.680
With Parallelism -- Time Elapsed: 2.982
我的设备规格:
Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Number of Cores: 12