n_jobs 对于带有估计器的 sklearn 多输出回归器=随机森林回归器
n_jobs for sklearn multioutput regressor with estimator=random forest regressor
当random forest estimator for multioutput regressor和多输出回归器本身都有时,应该如何使用:param n_jobs:
?例如,是否最好不要为估计器指定 n_jobs
,而为多输出回归器指定 n_jobs
?几个例子如下所示:
# Imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
# (1) No parallelization
rf_no_jobs = RandomForestRegressor()
multioutput_no_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs)
# (2) RF w/ parallelization, multioutput w/o parallelization
rf_with_jobs = RandomForestRegressor(n_jobs=-1)
multioutput_no_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs)
# (3) RF w/o parallelization, multioutput w parallelization
multioutput_with_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs, n_jobs=-1)
# (4) Both parallelized
multioutput_with_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs, n_jobs=-1)
由于 RandomForestRegressor
具有 'native' 多输出支持(不需要多输出包装器),我反而查看了 KNeighborsRegressor
和 LightGBM
,它们具有内部 n_jobs
争论,我也有同样的问题。
运行 在 Ryzen 5950X (Linux) 和 Intel 11800H (Windows) 上,n_jobs = 8,我发现一致的结果:
- Y 维度较低(例如,1 - 10 个目标),n_jobs 去哪里并不重要,无论如何它都会很快完成。初始化 multiprocessing 有大约 1 秒的开销,但 joblib 将默认重用现有池,加快速度。
- 具有高维度(比如 > 20)仅将 n_jobs 放在
MultiOutputRegressor
中,KNN 接收 n_jobs=1 是 10 倍快 在 160 dimensions/targets.
- 使用
with joblib.parallel_backend("loky", n_jobs=your_n_jobs):
同样快速并且方便地为里面的所有 sklearn 设置 n_jobs。这是简单的选择。
RegressorChain
在低维时足够快,但速度慢得离谱(500 倍慢 vs Multioutput),KNeighbors 有 160 个维度(我会坚持 LightGBM
与性能更好的 RegressorChain
一起使用。
- 使用
LightGBM
,MultiOutputRegressor
仅设置n_jobs再次比内部n_jobs快,但差异小得多(5950x Linux差异是 3 倍,11800H Windows 仅 1.2 倍)。
由于完整代码有点长,下面是一个包含大部分代码的部分示例:
from timeit import default_timer as timer
import numpy as np
from joblib import parallel_backend
from sklearn.neighbors import KNeighborsRegressor
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
from sklearn.datasets import fetch_california_housing
# adjust n_jobs to the number of physical CPU cores on your machine or pass -1 for auto max
n_jobs = 8
knn_model_param_dict = {} # kwargs if desired
num_y_dims = 160
X, y_one_dim = fetch_california_housing(return_X_y=True)
y_one_dim = y_one_dim.reshape(-1, 1)
# extra multioutput dims generated randomly
dims = [y_one_dim]
for _ in range(num_y_dims - 1):
dims.append(np.random.gamma(y_one_dim.std(), size=y_one_dim.shape))
y = np.concatenate(dims, axis=1)
# INIT
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict),
n_jobs=n_jobs,
).fit(X, y)
trial = "KNN with all n_jobs=1"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "KNN inner model with n_jobs"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=n_jobs),
n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "KNN outer multioutput with n_jobs, inner with 1"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
n_jobs=n_jobs,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "KNN inner and outer both -1"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=-1),
n_jobs=-1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "joblib backend chooses"
start = timer()
with parallel_backend("loky", n_jobs=n_jobs):
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict),
)
regr.fit(X, y)
regr.predict(X)
end = timer()
当random forest estimator for multioutput regressor和多输出回归器本身都有时,应该如何使用:param n_jobs:
?例如,是否最好不要为估计器指定 n_jobs
,而为多输出回归器指定 n_jobs
?几个例子如下所示:
# Imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
# (1) No parallelization
rf_no_jobs = RandomForestRegressor()
multioutput_no_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs)
# (2) RF w/ parallelization, multioutput w/o parallelization
rf_with_jobs = RandomForestRegressor(n_jobs=-1)
multioutput_no_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs)
# (3) RF w/o parallelization, multioutput w parallelization
multioutput_with_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs, n_jobs=-1)
# (4) Both parallelized
multioutput_with_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs, n_jobs=-1)
由于 RandomForestRegressor
具有 'native' 多输出支持(不需要多输出包装器),我反而查看了 KNeighborsRegressor
和 LightGBM
,它们具有内部 n_jobs
争论,我也有同样的问题。
运行 在 Ryzen 5950X (Linux) 和 Intel 11800H (Windows) 上,n_jobs = 8,我发现一致的结果:
- Y 维度较低(例如,1 - 10 个目标),n_jobs 去哪里并不重要,无论如何它都会很快完成。初始化 multiprocessing 有大约 1 秒的开销,但 joblib 将默认重用现有池,加快速度。
- 具有高维度(比如 > 20)仅将 n_jobs 放在
MultiOutputRegressor
中,KNN 接收 n_jobs=1 是 10 倍快 在 160 dimensions/targets. - 使用
with joblib.parallel_backend("loky", n_jobs=your_n_jobs):
同样快速并且方便地为里面的所有 sklearn 设置 n_jobs。这是简单的选择。 RegressorChain
在低维时足够快,但速度慢得离谱(500 倍慢 vs Multioutput),KNeighbors 有 160 个维度(我会坚持LightGBM
与性能更好的RegressorChain
一起使用。- 使用
LightGBM
,MultiOutputRegressor
仅设置n_jobs再次比内部n_jobs快,但差异小得多(5950x Linux差异是 3 倍,11800H Windows 仅 1.2 倍)。
由于完整代码有点长,下面是一个包含大部分代码的部分示例:
from timeit import default_timer as timer
import numpy as np
from joblib import parallel_backend
from sklearn.neighbors import KNeighborsRegressor
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
from sklearn.datasets import fetch_california_housing
# adjust n_jobs to the number of physical CPU cores on your machine or pass -1 for auto max
n_jobs = 8
knn_model_param_dict = {} # kwargs if desired
num_y_dims = 160
X, y_one_dim = fetch_california_housing(return_X_y=True)
y_one_dim = y_one_dim.reshape(-1, 1)
# extra multioutput dims generated randomly
dims = [y_one_dim]
for _ in range(num_y_dims - 1):
dims.append(np.random.gamma(y_one_dim.std(), size=y_one_dim.shape))
y = np.concatenate(dims, axis=1)
# INIT
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict),
n_jobs=n_jobs,
).fit(X, y)
trial = "KNN with all n_jobs=1"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "KNN inner model with n_jobs"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=n_jobs),
n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "KNN outer multioutput with n_jobs, inner with 1"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
n_jobs=n_jobs,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "KNN inner and outer both -1"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=-1),
n_jobs=-1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "joblib backend chooses"
start = timer()
with parallel_backend("loky", n_jobs=n_jobs):
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict),
)
regr.fit(X, y)
regr.predict(X)
end = timer()