多个图像意味着 dask.delayed 与 dask.array
Multiple images mean dask.delayed vs. dask.array
背景
我有一个列表,其中包含经过预处理并保存为 .npy 二进制文件的数千个图像堆栈(3D numpy 数组)的路径。
案例研究 我想计算所有图像的平均值,为了加快分析速度,我想并行处理。
方法使用 dask.delayed
# List with the file names
flist_img_to_filter
# I chunk the list of paths in sublists. The number of chunks correspond to
# the number of cores used for the analysis
chunked_list
# Scatter the images sublists to be able to process in parallel
futures = client.scatter(chunked_list)
# Create dask processing graph
output = []
for future in futures:
ImgMean = delayed(partial_image_mean)(future)
output.append(ImgMean)
ImgMean_all = delayed(sum)(output)
ImgMean_all = ImgMean_all/len(futures)
# Compute the graph
ImgMean = ImgMean_all.compute()
方法使用 dask.arrays
修改自 Matthew Rocklin blog
imread = delayed(np.load, pure=True) # Lazy version of imread
# Lazily evaluate imread on each path
lazy_values = [imread(img_path) for img_path in flist_img_to_filter]
arrays = [da.from_delayed(lazy_value, dtype=np.uint16,shape=shape) for
lazy_value in lazy_values]
# Stack all small Dask arrays into one
stack = da.stack(arrays, axis=0)
ImgMean = stack.mean(axis=0).compute()
问题
1. 在dask.delayed
方法中是否有必要对列表进行预分块?如果我分散原始列表,我将获得每个元素的未来。有没有办法告诉工作人员处理它有权访问的期货?
2. dask.arrays
方法明显更慢并且内存使用量更高。这是'bad way'使用dask.arrays吗?
3.有没有更好的方法来解决这个问题?
谢谢!
In the dask.delayed approach is it necessary to pre-chunk the list? If I scatter the original list I obtain a future for each element. Is there a way to tell a worker to process the futures it has access to?
简单的答案是否定的,从 Dask 版本 0.15.4 开始,没有非常可靠的方法来提交对 "all of the tasks of a certain type currently present on this worker" 的计算。
但是,您可以使用 who_has
或 has_what
客户端方法轻松询问调度程序哪些密钥存在于调度程序中。
from dask.distributed import wait
import wait
futures = dask.persist(futures)
wait(futures)
client.who_has(futures)
The dask.arrays approach is significantly slower and with higher memory usage. Is this a 'bad way' to use dask.arrays?
您可能想使用 mean
函数的 split_every=
关键字,或者 rechunk
您的数组在调用之前将图像组合在一起(可能与您上面所做的类似)意思是玩 parallelism/memory 权衡。
Is there a better way to approach the issue?
您也可以为此尝试 as_completed and compute running means as data completes. You would have to switch from delayed to futures。
背景
我有一个列表,其中包含经过预处理并保存为 .npy 二进制文件的数千个图像堆栈(3D numpy 数组)的路径。
案例研究 我想计算所有图像的平均值,为了加快分析速度,我想并行处理。
方法使用 dask.delayed
# List with the file names
flist_img_to_filter
# I chunk the list of paths in sublists. The number of chunks correspond to
# the number of cores used for the analysis
chunked_list
# Scatter the images sublists to be able to process in parallel
futures = client.scatter(chunked_list)
# Create dask processing graph
output = []
for future in futures:
ImgMean = delayed(partial_image_mean)(future)
output.append(ImgMean)
ImgMean_all = delayed(sum)(output)
ImgMean_all = ImgMean_all/len(futures)
# Compute the graph
ImgMean = ImgMean_all.compute()
方法使用 dask.arrays
修改自 Matthew Rocklin blog
imread = delayed(np.load, pure=True) # Lazy version of imread
# Lazily evaluate imread on each path
lazy_values = [imread(img_path) for img_path in flist_img_to_filter]
arrays = [da.from_delayed(lazy_value, dtype=np.uint16,shape=shape) for
lazy_value in lazy_values]
# Stack all small Dask arrays into one
stack = da.stack(arrays, axis=0)
ImgMean = stack.mean(axis=0).compute()
问题
1. 在dask.delayed
方法中是否有必要对列表进行预分块?如果我分散原始列表,我将获得每个元素的未来。有没有办法告诉工作人员处理它有权访问的期货?
2. dask.arrays
方法明显更慢并且内存使用量更高。这是'bad way'使用dask.arrays吗?
3.有没有更好的方法来解决这个问题?
谢谢!
In the dask.delayed approach is it necessary to pre-chunk the list? If I scatter the original list I obtain a future for each element. Is there a way to tell a worker to process the futures it has access to?
简单的答案是否定的,从 Dask 版本 0.15.4 开始,没有非常可靠的方法来提交对 "all of the tasks of a certain type currently present on this worker" 的计算。
但是,您可以使用 who_has
或 has_what
客户端方法轻松询问调度程序哪些密钥存在于调度程序中。
from dask.distributed import wait
import wait
futures = dask.persist(futures)
wait(futures)
client.who_has(futures)
The dask.arrays approach is significantly slower and with higher memory usage. Is this a 'bad way' to use dask.arrays?
您可能想使用 mean
函数的 split_every=
关键字,或者 rechunk
您的数组在调用之前将图像组合在一起(可能与您上面所做的类似)意思是玩 parallelism/memory 权衡。
Is there a better way to approach the issue?
您也可以为此尝试 as_completed and compute running means as data completes. You would have to switch from delayed to futures。