n_jobs 对于带有估计器的 sklearn 多输出回归器=随机森林回归器

n_jobs for sklearn multioutput regressor with estimator=random forest regressor

random forest estimator for multioutput regressor和多输出回归器本身都有时,应该如何使用:param n_jobs:?例如,是否最好不要为估计器指定 n_jobs,而为多输出回归器指定 n_jobs?几个例子如下所示:

# Imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

# (1) No parallelization
rf_no_jobs = RandomForestRegressor()
multioutput_no_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs)

# (2) RF w/ parallelization, multioutput w/o parallelization
rf_with_jobs = RandomForestRegressor(n_jobs=-1)
multioutput_no_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs)

# (3) RF w/o parallelization, multioutput w parallelization
multioutput_with_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs, n_jobs=-1)

# (4) Both parallelized
multioutput_with_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs, n_jobs=-1)

由于 RandomForestRegressor 具有 'native' 多输出支持(不需要多输出包装器),我反而查看了 KNeighborsRegressorLightGBM,它们具有内部 n_jobs 争论,我也有同样的问题。

运行 在 Ryzen 5950X (Linux) 和 Intel 11800H (Windows) 上,n_jobs = 8,我发现一致的结果:

  • Y 维度较低(例如,1 - 10 个目标),n_jobs 去哪里并不重要,无论如何它都会很快完成。初始化 multiprocessing 有大约 1 秒的开销,但 joblib 将默认重用现有池,加快速度。
  • 具有高维度(比如 > 20)仅将 n_jobs 放在 MultiOutputRegressor 中,KNN 接收 n_jobs=1 是 10 倍快 在 160 dimensions/targets.
  • 使用 with joblib.parallel_backend("loky", n_jobs=your_n_jobs): 同样快速并且方便地为里面的所有 sklearn 设置 n_jobs。这是简单的选择。
  • RegressorChain 在低维时足够快,但速度慢得离谱(500 倍慢 vs Multioutput),KNeighbors 有 160 个维度(我会坚持 LightGBM 与性能更好的 RegressorChain 一起使用。
  • 使用LightGBMMultiOutputRegressor仅设置n_jobs再次比内部n_jobs快,但差异小得多(5950x Linux差异是 3 倍,11800H Windows 仅 1.2 倍)。

由于完整代码有点长,下面是一个包含大部分代码的部分示例:

from timeit import default_timer as timer
import numpy as np
from joblib import parallel_backend
from sklearn.neighbors import KNeighborsRegressor
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
from sklearn.datasets import fetch_california_housing

# adjust n_jobs to the number of physical CPU cores on your machine or pass -1 for auto max
n_jobs = 8
knn_model_param_dict = {}  # kwargs if desired
num_y_dims = 160

X, y_one_dim = fetch_california_housing(return_X_y=True)
y_one_dim = y_one_dim.reshape(-1, 1)
# extra multioutput dims generated randomly
dims = [y_one_dim]
for _ in range(num_y_dims - 1):
    dims.append(np.random.gamma(y_one_dim.std(), size=y_one_dim.shape))
y = np.concatenate(dims, axis=1)


# INIT
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict),
    n_jobs=n_jobs,
).fit(X, y)

trial = "KNN with all n_jobs=1"
start = timer()
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
    n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")

trial = "KNN inner model with n_jobs"
start = timer()
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict, n_jobs=n_jobs),
    n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")

trial = "KNN outer multioutput with n_jobs, inner with 1"
start = timer()
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
    n_jobs=n_jobs,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")

trial = "KNN inner and outer both -1"
start = timer()
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict, n_jobs=-1),
    n_jobs=-1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")

trial = "joblib backend chooses"
start = timer()
with parallel_backend("loky", n_jobs=n_jobs):
    regr = MultiOutputRegressor(
        KNeighborsRegressor(**knn_model_param_dict),
    )
    regr.fit(X, y)
    regr.predict(X)
end = timer()