(Dask) 如何分配计算所需的昂贵资源?
(Dask) How to distribute expensive resource needed for computation?
在使用相对昂贵的创建资源或对象进行计算的数据集中分配任务的最佳方式是什么。
# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))
# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...
我计划将其与 dask_jobqueue 和 SGECluster 一起使用。
foo = dask.delayed(Foo)() # create your expensive thing on the workers instead of locally
def do(row, foo):
return foo.do(row)
df.apply(do, foo=foo) # include it as an explicit argument, not a closure within a lambda
在使用相对昂贵的创建资源或对象进行计算的数据集中分配任务的最佳方式是什么。
# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))
# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...
我计划将其与 dask_jobqueue 和 SGECluster 一起使用。
foo = dask.delayed(Foo)() # create your expensive thing on the workers instead of locally
def do(row, foo):
return foo.do(row)
df.apply(do, foo=foo) # include it as an explicit argument, not a closure within a lambda