在 HPC 集群上使用 dask 分配大量作业的策略

Question

我有一个相当复杂的 python 算法，我需要在 HPC 集群中分布。

代码是运行来自具有 60 GB 内存的 Jupyterhub 实例。 PBS集群的配置是1进程，1核，每个worker 30Gb，nanny=False（否则计算不会运行）总共26个worker（总内存约726GB）

我不需要取回任何数据，因为需要的数据会在计算结束时立即写入磁盘。请注意，当运行独立时，每个计算大约需要 7 分钟。

我运行遇到的问题如下：每个独立工作者（工作名称：dask-worker）似乎运行很好，它有大约 20Gb 可用，其中最多使用 5Gb。但是每当我尝试启动超过 50 个工作时，中央工作人员（工作名称：jupyterhub）运行大约 20 分钟后内存不足。

这是我分配计算的方式：

def complex_python_func(params):
    return compute(params=params).run()

然后我尝试使用 client.map 或延迟 :

list_of_params = [1, 2, 3, 4, 5, ... n] # with n > 256

# With delayed
lazy = [dask.delayed(complex_python_func)(l) for l in list_of_params]
futures = client.compute(lazy)
# Or with map
chain = client.map(complex_python_func, list_of_params)

这里是集群的配置：

cluster = PBSCluster(
    cores=1,
    memory="30GB",
    interface="ib0",
    queue=queue,
    processes=1,
    nanny=False,
    walltime="12:00:00",
    shebang="#!/bin/bash",
    env_extra=env_extra,
    python=python_bin,
)
cluster.scale(32)

我不明白为什么它不起作用。我希望 Dask 运行每次计算然后释放内存（每个任务大约每 6/7 分钟）。我用 qstat -f jobId 检查了 worker 的内存使用情况，它一直在增加，直到 worker 被杀死。

是什么导致 jupyterhub worker 失败以及实现此目的的好方法（或至少是更好的方法）是什么？

Answer 1

两个潜在的线索是：

如果不期望工人 return 任何事情，那么可能值得将 return 语句更改为 return None（尚不清楚 compute() 在你的脚本):

 def complex_python_func(params):
    return compute(params=params).run()

有可能 dask 为每个工人分配了不止一份工作，并且在某些时候，工人的任务多于它可以处理的。解决此问题的一种方法是使用 resources 减少工作人员在任何给定时间可以完成的任务数量，例如使用：

# add resources when creating the cluster
cluster = PBSCluster(
    # all other settings are unchanged, but add this line to give each worker
    extra=['--resources foo=1'],
)

# rest of code skipped, but make sure to specify resources needed by task
# when submitting it for computation
lazy = [dask.delayed(complex_python_func)(l) for l in list_of_params]
futures = client.compute(lazy, resources={'foo': 1})
# Or with map
chain = client.map(complex_python_func, list_of_params, resources={'foo': 1})

有关资源的更多信息，请参阅文档或此相关问题

在 HPC 集群上使用 dask 分配大量作业的策略

Strategy to distribute large number of jobs with dask on HPC cluster

python

hpc

dask

dask-delayed

dask-distributed