并行化 RandomizedSearchCV 以限制使用的 CPU 数量
Parallelize RandomizedSearchCV to restrict number CPUs used
当我使用 sklearn RandomizedSearchCV
拟合模型时,我试图限制 CPU 的使用数量,但不知何故我一直使用所有 CPU。根据 Python scikit learn n_jobs 的回答,我看到在 scikit-learn 中,我们可以使用 n_jobs
来控制使用的 CPU-cores 的数量。
n_jobs
is an integer, specifying the maximum number of concurrently running workers. If 1 is given, no joblib
parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used.
For n_jobs
below -1, (n_cpus + 1 + n_jobs)
are used. For example with n_jobs=-2
, all CPUs but one are used.
但是当将 n_jobs
设置为 -5 时,所有 CPU 仍然会继续 运行 100%。我查看了 joblib 库以使用 Parallel
和 delayed
。但是我所有的 CPU 仍然继续被使用。这是我尝试过的:
from sklearn.model_selection import RandomizedSearchCV
from joblib import Parallel,delayed
def rscv_l(model, param_grid, X_train, y_train):
rs_model = RandomizedSearchCV(model, param_grid, n_iter=10,
n_jobs=-5, verbose=2, cv=5,
scoring='r2')
rs_model.fit(X_train, y_train) # the cpu usage problem comes here
return rs_model
# Here my attempt to parallelize and set my function as iterable
results = Parallel( n_jobs = -5 )( delayed( rscv_l )( model,
param_grid,
X, y )
for X, y
in zip( [X_train],
[y_train] ) )
出了什么问题?
更新:
看着How do you stop numpy from multithreading?,我想我可能遇到了多线程问题。当我检查 numpy 配置时,我发现:
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['user/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['user/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['user/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['user/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['user/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['user/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['user/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['user/include']
但建议的解决方案仍然不适合我:
import os
os.environ["OMP_NUM_THREADS"] = "4" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "4" # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = "6" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "4" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "6" # export NUMEXPR_NUM_THREADS=6
import numpy
from sklearn.model_selection import RandomizedSearchCV
这解决了我的问题:
感谢@user3666197 的回答,我决定限制整个脚本的 CPU 数量,并简单地使用 n_jobs 和一个正整数。这解决了我的 CPU 使用问题:
import os
n_jobs = 2 # The number of tasks to run in parallel
n_cpus = 2 # Number of CPUs assigned to this process
pid = os.getpid()
print("PID: %i" % pid)
# Control which CPUs are made available for this script
cpu_arg = ''.join([str(ci) + ',' for ci in list(range(n_cpus))])[:-1]
cmd = 'taskset -cp %s %i' % (cpu_arg, pid)
print("executing command '%s' ..." % cmd)
os.system(cmd)
# hyperparameter tunning
rs_model = RandomizedSearchCV(xgb, param_grid, n_iter=10,
n_jobs=n_jobs, verbose=2, cv= n_folds,
scoring='r2')
#model fitting
rs_model.fit(X_train,y_train)
Q : " What is going wrong? "
A :
没有一件事我们可以说它“出错”, code-execution eco-system 是如此 multi-layered,它并不像我们希望享受的那么琐碎 & 有几个(不同的,一些隐藏的)地方,由配置决定,有多少 CPU-cores其实会承担整体processing-load.
情况也是 version-dependent & configuration-specific(Scikit、Numpy、Scipy 都具有相互依赖性以及对所用数值包的各自编译选项的潜在依赖性)
实验
证明 - 或 - 反驳刚刚假定的语法 (d) 效果:
鉴于 top-level n_jobs
方法中 RandomizedSearchCV(...)
参数中负数解释的记录特征,提交完全相同的任务,尚未配置,以便它获得明确的允许数量 (top-level) n_jobs = CPU_cores_allowed_to_load
并观察,在整个处理流程中,何时以及有多少核心确实被加载。
结果:
当且仅当加载了“允许”的数量 CPU-cores 时,top-level 调用正确将参数设置“传播”到与处理流程一起使用的每个方法或过程
如果您的观察证明设置没有被“遵守”,我们只能审查所有 source-code 垂直领域的整个范围来决定,谁应该为这种 dis-obedience 的不遵守负责使工作符合 top-level 为 n_jobs
设定的上限。虽然用于 CPU-core 亲和映射的 O/S 工具可能会给我们一些机会“从外部”限制使用的此类内核的数量,但还有一些其他不利影响(add-on 管理成本最低 performance-punishing ones ) 将出现 - thermal-management 引入 CPU-core “跳跃”,这是亲和图所不允许的,将在现代处理器上导致越来越多的减少 clock-frequency (因为内核确实得到在数值密集型处理中很热),从而延长了整个任务处理时间,因为系统中有“更酷”(因此更快)CPU-cores(那些被 affinity-mapping 阻止使用的) , 然而这些是非常相同的 CPU-cores, affinity-mappings 被禁止用于暂时放置我们的任务处理 (而热的, 由于达到 thermal-ceilings,有时间冷静下来 re-gain 运行 的几率不会降低 CPU-clock-rates)
Top-level 调用可能已经设置了一个 n_jobs
参数,但是任何 lower-level 组件可能已经“遵守”了那个值(不知道有多少其他同时工作的对等点做了同样的事情——就像 joblib.Parallel()
和类似的构造函数一样,更不用说另一个,固有部署的,GIL-evading 多线程库——因为它恰好缺乏任何相互协调以保持 top-level设置 n_jobs
-上限 )
def rscv_l( model, param_grid, X_train, y_train ):
rs_model = RandomizedSearchCV( model,
param_grid,
n_iter = 10,
n_jobs = 1, # DO NOT CANNIBALISE MORE
verbose = 2, # AS BEING RUN
cv = 5, # IN CONFLICT
scoring = 'r2'# WITH OUTER-SETTINGS
) # ----vvv----------
rs_model.fit( X_train, y_train ) # the cpu usage problem comes here
return rs_model
################################################################
#
# Here my attempt to parallelize and set my function as iterable
#
results = Parallel( n_jobs = -5 # <------------- joblib spawns that many workers
)( delayed( rscv_l ) # <---# HERE, avoid
( model, # UNCOORDINATED
param_grid, # CPU-CANNIBALISM
X, y ) # ref. above
for X, y in zip( [X_train],
[y_train] )
)
如果有兴趣了解更多详情
你可能也喜欢这个
this
"How does scikit-learn handle..."
and
from
other sources,涵盖了这个问题。
当我使用 sklearn RandomizedSearchCV
拟合模型时,我试图限制 CPU 的使用数量,但不知何故我一直使用所有 CPU。根据 Python scikit learn n_jobs 的回答,我看到在 scikit-learn 中,我们可以使用 n_jobs
来控制使用的 CPU-cores 的数量。
n_jobs
is an integer, specifying the maximum number of concurrently running workers. If 1 is given, nojoblib
parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used.
Forn_jobs
below -1,(n_cpus + 1 + n_jobs)
are used. For example withn_jobs=-2
, all CPUs but one are used.
但是当将 n_jobs
设置为 -5 时,所有 CPU 仍然会继续 运行 100%。我查看了 joblib 库以使用 Parallel
和 delayed
。但是我所有的 CPU 仍然继续被使用。这是我尝试过的:
from sklearn.model_selection import RandomizedSearchCV
from joblib import Parallel,delayed
def rscv_l(model, param_grid, X_train, y_train):
rs_model = RandomizedSearchCV(model, param_grid, n_iter=10,
n_jobs=-5, verbose=2, cv=5,
scoring='r2')
rs_model.fit(X_train, y_train) # the cpu usage problem comes here
return rs_model
# Here my attempt to parallelize and set my function as iterable
results = Parallel( n_jobs = -5 )( delayed( rscv_l )( model,
param_grid,
X, y )
for X, y
in zip( [X_train],
[y_train] ) )
出了什么问题?
更新: 看着How do you stop numpy from multithreading?,我想我可能遇到了多线程问题。当我检查 numpy 配置时,我发现:
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['user/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['user/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['user/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['user/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['user/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['user/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['user/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['user/include']
但建议的解决方案仍然不适合我:
import os
os.environ["OMP_NUM_THREADS"] = "4" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "4" # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = "6" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "4" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "6" # export NUMEXPR_NUM_THREADS=6
import numpy
from sklearn.model_selection import RandomizedSearchCV
这解决了我的问题: 感谢@user3666197 的回答,我决定限制整个脚本的 CPU 数量,并简单地使用 n_jobs 和一个正整数。这解决了我的 CPU 使用问题:
import os
n_jobs = 2 # The number of tasks to run in parallel
n_cpus = 2 # Number of CPUs assigned to this process
pid = os.getpid()
print("PID: %i" % pid)
# Control which CPUs are made available for this script
cpu_arg = ''.join([str(ci) + ',' for ci in list(range(n_cpus))])[:-1]
cmd = 'taskset -cp %s %i' % (cpu_arg, pid)
print("executing command '%s' ..." % cmd)
os.system(cmd)
# hyperparameter tunning
rs_model = RandomizedSearchCV(xgb, param_grid, n_iter=10,
n_jobs=n_jobs, verbose=2, cv= n_folds,
scoring='r2')
#model fitting
rs_model.fit(X_train,y_train)
Q : " What is going wrong? "
A :
没有一件事我们可以说它“出错”, code-execution eco-system 是如此 multi-layered,它并不像我们希望享受的那么琐碎 & 有几个(不同的,一些隐藏的)地方,由配置决定,有多少 CPU-cores其实会承担整体processing-load.
情况也是 version-dependent & configuration-specific(Scikit、Numpy、Scipy 都具有相互依赖性以及对所用数值包的各自编译选项的潜在依赖性)
实验
证明 - 或 - 反驳刚刚假定的语法 (d) 效果:
鉴于 top-level n_jobs
方法中 RandomizedSearchCV(...)
参数中负数解释的记录特征,提交完全相同的任务,尚未配置,以便它获得明确的允许数量 (top-level) n_jobs = CPU_cores_allowed_to_load
并观察,在整个处理流程中,何时以及有多少核心确实被加载。
结果:
当且仅当加载了“允许”的数量 CPU-cores 时,top-level 调用正确将参数设置“传播”到与处理流程一起使用的每个方法或过程
如果您的观察证明设置没有被“遵守”,我们只能审查所有 source-code 垂直领域的整个范围来决定,谁应该为这种 dis-obedience 的不遵守负责使工作符合 top-level 为 n_jobs
设定的上限。虽然用于 CPU-core 亲和映射的 O/S 工具可能会给我们一些机会“从外部”限制使用的此类内核的数量,但还有一些其他不利影响(add-on 管理成本最低 performance-punishing ones ) 将出现 - thermal-management 引入 CPU-core “跳跃”,这是亲和图所不允许的,将在现代处理器上导致越来越多的减少 clock-frequency (因为内核确实得到在数值密集型处理中很热),从而延长了整个任务处理时间,因为系统中有“更酷”(因此更快)CPU-cores(那些被 affinity-mapping 阻止使用的) , 然而这些是非常相同的 CPU-cores, affinity-mappings 被禁止用于暂时放置我们的任务处理 (而热的, 由于达到 thermal-ceilings,有时间冷静下来 re-gain 运行 的几率不会降低 CPU-clock-rates)
Top-level 调用可能已经设置了一个 n_jobs
参数,但是任何 lower-level 组件可能已经“遵守”了那个值(不知道有多少其他同时工作的对等点做了同样的事情——就像 joblib.Parallel()
和类似的构造函数一样,更不用说另一个,固有部署的,GIL-evading 多线程库——因为它恰好缺乏任何相互协调以保持 top-level设置 n_jobs
-上限 )
def rscv_l( model, param_grid, X_train, y_train ):
rs_model = RandomizedSearchCV( model,
param_grid,
n_iter = 10,
n_jobs = 1, # DO NOT CANNIBALISE MORE
verbose = 2, # AS BEING RUN
cv = 5, # IN CONFLICT
scoring = 'r2'# WITH OUTER-SETTINGS
) # ----vvv----------
rs_model.fit( X_train, y_train ) # the cpu usage problem comes here
return rs_model
################################################################
#
# Here my attempt to parallelize and set my function as iterable
#
results = Parallel( n_jobs = -5 # <------------- joblib spawns that many workers
)( delayed( rscv_l ) # <---# HERE, avoid
( model, # UNCOORDINATED
param_grid, # CPU-CANNIBALISM
X, y ) # ref. above
for X, y in zip( [X_train],
[y_train] )
)
如果有兴趣了解更多详情
你可能也喜欢这个
this
"How does scikit-learn handle..."
and
from
other sources,涵盖了这个问题。