一个 dask worker 中同时存在多个 get_dataset

Multiple get_dataset concurrently in a dask worker

TL;DR
如果在另一个查询正在下载所需的数据集时出现多个查询——Dask 会尝试多次下载数据集吗?或者它会承认它是 "in flight" 并自动等待它完成吗?

背景
如果我有一个刚刚启动的工作人员(尚未将数据集加载到内存中)并且我的函数要求提供数据集,它将根据需要下载到工作人员上。一个简单的场景:

(1) Worker boots
(2) Receives query which needs a dataset
(3) Downloads dataset (takes X seconds)
(4) Executes query

但是,如果我遇到以下情况:

(1) Worker boots
(2) Receives query which needs a dataset
(3) Downloads dataset (takes X seconds)
(4) Receives query which needs the same dataset which is currently downloading - will it download it again or detect in-flight?
(5) Receives another query which needs the same dataset which is currently downloading - will it download it again or detect in-flight?
(6) Execute queries

Dask 会尝试多次下载数据集,还是会确认它是 "in flight" 并自动等待它完成?

我已经阅读了源代码,但数据集 publish/list 对我来说仍然是一个黑盒子。

client.get_dataset的每次调用都是独立的,多次请求会导致冗余工作。话虽这么说,除了元数据(比如指向远程未来的 dask 集合)之外,你不应该在数据集中存储任何东西,所以如果正确使用,这个下载应该只需要几毫秒。