joblib 中的 batch_size 和 pre_dispatch 到底是什么意思
What batch_size and pre_dispatch in joblib exactly mean
来自此处的文档 https://pythonhosted.org/joblib/parallel.html#parallel-reference-documentation
我不清楚 batch_size
和 pre_dispatch
到底是什么意思。
让我们考虑使用 'multiprocessing'
后端、2 个作业(2 个进程)并且我们有 10 个任务要计算的情况。
据我了解:
batch_size
- 控制一次 pickle 任务的数量,所以如果你设置 batch_size = 5
- joblib 将 pickle 并立即向每个进程发送 5 个任务,到达那里后它们将被解决按顺序,一个接一个地处理。使用 batch_size=1
joblib 将一次处理并发送一个任务,当且仅当该进程完成了上一个任务。
说明我的意思:
def solve_one_task(task):
# Solves one task at a time
....
return result
def solve_list(list_of_tasks):
# Solves batch of tasks sequentially
return [solve_one_task(task) for task in list_of_tasks]
所以这段代码:
Parallel(n_jobs=2, backend = 'multiprocessing', batch_size=5)(
delayed(solve_one_task)(task) for task in tasks)
等于此代码(在性能上):
slices = [(0,5)(5,10)]
Parallel(n_jobs=2, backend = 'multiprocessing', batch_size=1)(
delayed(solve_list)(tasks[slice[0]:slice[1]]) for slice in slices)
我说的对吗?那么 pre_dispatch
是什么意思?
事实证明,我是对的,两段代码在性能上非常相似,所以 batch_size
就像我在问题中预期的那样工作。 pre_dispatch(如文档所述)控制任务队列中实例化任务的数量。
from sklearn.externals.joblib import Parallel, delayed
from time import sleep, time
def solve_one_task(task):
# Solves one task at a time
print("%d. Task #%d is being solved"%(time(), task))
sleep(5)
return task
def task_gen(max_task):
current_task = 0
while current_task < max_task:
print("%d. Task #%d was dispatched"%(time(), current_task))
yield current_task
current_task += 1
Parallel(n_jobs=2, backend = 'multiprocessing', batch_size=1, pre_dispatch=3)(
delayed(solve_one_task)(task) for task in task_gen(10))
输出:
1450105367. Task #0 was dispatched
1450105367. Task #1 was dispatched
1450105367. Task #2 was dispatched
1450105367. Task #0 is being solved
1450105367. Task #1 is being solved
1450105372. Task #2 is being solved
1450105372. Task #3 was dispatched
1450105372. Task #4 was dispatched
1450105372. Task #3 is being solved
1450105377. Task #4 is being solved
1450105377. Task #5 was dispatched
1450105377. Task #5 is being solved
1450105377. Task #6 was dispatched
1450105382. Task #7 was dispatched
1450105382. Task #6 is being solved
1450105382. Task #7 is being solved
1450105382. Task #8 was dispatched
1450105387. Task #9 was dispatched
1450105387. Task #8 is being solved
1450105387. Task #9 is being solved
Out[1]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
来自此处的文档 https://pythonhosted.org/joblib/parallel.html#parallel-reference-documentation
我不清楚 batch_size
和 pre_dispatch
到底是什么意思。
让我们考虑使用 'multiprocessing'
后端、2 个作业(2 个进程)并且我们有 10 个任务要计算的情况。
据我了解:
batch_size
- 控制一次 pickle 任务的数量,所以如果你设置 batch_size = 5
- joblib 将 pickle 并立即向每个进程发送 5 个任务,到达那里后它们将被解决按顺序,一个接一个地处理。使用 batch_size=1
joblib 将一次处理并发送一个任务,当且仅当该进程完成了上一个任务。
说明我的意思:
def solve_one_task(task):
# Solves one task at a time
....
return result
def solve_list(list_of_tasks):
# Solves batch of tasks sequentially
return [solve_one_task(task) for task in list_of_tasks]
所以这段代码:
Parallel(n_jobs=2, backend = 'multiprocessing', batch_size=5)(
delayed(solve_one_task)(task) for task in tasks)
等于此代码(在性能上):
slices = [(0,5)(5,10)]
Parallel(n_jobs=2, backend = 'multiprocessing', batch_size=1)(
delayed(solve_list)(tasks[slice[0]:slice[1]]) for slice in slices)
我说的对吗?那么 pre_dispatch
是什么意思?
事实证明,我是对的,两段代码在性能上非常相似,所以 batch_size
就像我在问题中预期的那样工作。 pre_dispatch(如文档所述)控制任务队列中实例化任务的数量。
from sklearn.externals.joblib import Parallel, delayed
from time import sleep, time
def solve_one_task(task):
# Solves one task at a time
print("%d. Task #%d is being solved"%(time(), task))
sleep(5)
return task
def task_gen(max_task):
current_task = 0
while current_task < max_task:
print("%d. Task #%d was dispatched"%(time(), current_task))
yield current_task
current_task += 1
Parallel(n_jobs=2, backend = 'multiprocessing', batch_size=1, pre_dispatch=3)(
delayed(solve_one_task)(task) for task in task_gen(10))
输出:
1450105367. Task #0 was dispatched
1450105367. Task #1 was dispatched
1450105367. Task #2 was dispatched
1450105367. Task #0 is being solved
1450105367. Task #1 is being solved
1450105372. Task #2 is being solved
1450105372. Task #3 was dispatched
1450105372. Task #4 was dispatched
1450105372. Task #3 is being solved
1450105377. Task #4 is being solved
1450105377. Task #5 was dispatched
1450105377. Task #5 is being solved
1450105377. Task #6 was dispatched
1450105382. Task #7 was dispatched
1450105382. Task #6 is being solved
1450105382. Task #7 is being solved
1450105382. Task #8 was dispatched
1450105387. Task #9 was dispatched
1450105387. Task #8 is being solved
1450105387. Task #9 is being solved
Out[1]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]