使用 dask.bag 与普通 python 列表?
Using dask.bag vs normal python list?
当我 运行 下面这个并行 dask.bag
代码时,我的计算速度似乎比顺序 Python 代码慢得多。对原因有何见解?
import dask.bag as db
def is_even(x):
return not x % 2
任务代码:
%%timeit
b = db.from_sequence(range(2000000))
c = b.filter(is_even).map(lambda x: x ** 2)
c.compute()
>>> 12.8 s ± 1.15 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With n = 8000000
>>> 50.7 s ± 2.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Python代码:
%%timeit
b = list(range(2000000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 2, b))
>>> 547 ms ± 8.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With n = 8000000
>>> 2.25 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
感谢@abarnert 提出通过更长的任务长度查看开销的建议。
似乎每个任务的长度都太短了,开销让 Dask 变慢了。我将指数从 2
更改为 10000
以使每个任务更长。这个例子产生了我所期望的结果:
Python代码:
%%timeit
b = list(range(50000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 10000, b))
>>> 34.8 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
任务代码:
%%timeit
b = db.from_sequence(range(50000))
c = b.filter(is_even).map(lambda x: x ** 10000)
c.compute()
>>> 26.4 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
当我 运行 下面这个并行 dask.bag
代码时,我的计算速度似乎比顺序 Python 代码慢得多。对原因有何见解?
import dask.bag as db
def is_even(x):
return not x % 2
任务代码:
%%timeit
b = db.from_sequence(range(2000000))
c = b.filter(is_even).map(lambda x: x ** 2)
c.compute()
>>> 12.8 s ± 1.15 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With n = 8000000
>>> 50.7 s ± 2.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Python代码:
%%timeit
b = list(range(2000000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 2, b))
>>> 547 ms ± 8.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With n = 8000000
>>> 2.25 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
感谢@abarnert 提出通过更长的任务长度查看开销的建议。
似乎每个任务的长度都太短了,开销让 Dask 变慢了。我将指数从 2
更改为 10000
以使每个任务更长。这个例子产生了我所期望的结果:
Python代码:
%%timeit
b = list(range(50000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 10000, b))
>>> 34.8 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
任务代码:
%%timeit
b = db.from_sequence(range(50000))
c = b.filter(is_even).map(lambda x: x ** 10000)
c.compute()
>>> 26.4 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)