创建分布式任务数组

Question

我有兴趣用我身边的一堆 netcdf 文件制作一个分布式 dask 数组。我沿着 "Distributed Dask arrays" 中概述的路径开始，但对 'distributed.collections'

的弃用感到有点困惑

现在创建分布式 dask 数组的最佳方法是什么？我有我的 dask-scheduler 和 dask-worker 任务运行。我可以成功执行以下操作：

from distributed import Client, progress
client = Client('scheduler-address:8786')
futures = client.map(load_netcdf, filenames)
progress(futures)

下一步是什么？

Answer 1

首先，如果您有很多 NetCDF 文件，那么您应该仔细研究 XArray 包，它包含 Dask.array 并管理所有 NetCDF 元数据约定。

特别是我认为您需要 open_mfdataset 函数。

如果您想使用该博文中的技术手动构建 dask.array，那么您应该使用 dask.delayed interface and the da.from_delayed 函数。

如果您想像该博文中那样使用 Futures，那很好，da.from_delayed 将接受 Futures 代替延迟值。

array_chunks = [da.from_delayed(future, shape=..., dtype=...) 
                for future in futures]
array = da.concatenate(array_chunks, axis=0)

Creating distributed dask arrays