为 dask_jobqueue 创建 local_directory
Create local_directory for dask_jobqueue
我正在尝试 运行 在使用 NFS 进行存储的 HPC 系统上进行 dask。因此,我想配置 dask 以将本地存储用于临时 space。每个集群节点都有一个所有用户都可以写入的 /scratch/
文件夹,其中包含将暂存文件放入 /scratch/<username>/<jobid>/
.
的说明
我有一些代码是这样配置的:
import dask_jobqueue
from distributed import Client
cluster = dask_jobqueue.SLURMCluster(
queue = 'high',
cores = 24,
memory = '60GB',
walltime = '10:00:00',
local_directory = '/scratch/<username>/<jobid>/'
)
cluster.scale(1)
client = Client(cluster)
但是,我有一个问题。该目录不存在(既因为我不知道客户端将在哪个节点上,又因为它是根据 SLURM 作业 ID 创建的,它始终是唯一的)所以我的代码失败了:
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/home/lsterzin/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/lsterzin/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 191, in _run
target(*args, **kwargs)
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 699, in _run
worker = Worker(**worker_kwargs)
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/worker.py", line 497, in __init__
self._workspace = WorkSpace(os.path.abspath(local_directory))
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/diskutils.py", line 118, in __init__
self._init_workspace()
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/diskutils.py", line 124, in _init_workspace
os.mkdir(self.base_dir)
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/<user>/<jobid>'
我无法在不知道 dask worker 运行 将在哪个节点的情况下创建目录,并且在目录已经存在的情况下我无法使用 dask_jobqueue
创建集群。解决此问题的最佳方法是什么?
我认为这可以用 /scratch/$USER/$SLURM_JOB_ID
来完成。如果这不起作用,也许可以通过配置文件定义 local-directory
:
https://jobqueue.dask.org/en/latest/configuration-setup.html#local-storage
示例配置也可能对您有用:
https://jobqueue.dask.org/en/latest/configurations.html
感谢@lsterzinger 提出的措辞恰当的问题
我在这里推送了一个可能有帮助的修复:https://github.com/dask/distributed/pull/3928
我们看看社区怎么说
我正在尝试 运行 在使用 NFS 进行存储的 HPC 系统上进行 dask。因此,我想配置 dask 以将本地存储用于临时 space。每个集群节点都有一个所有用户都可以写入的 /scratch/
文件夹,其中包含将暂存文件放入 /scratch/<username>/<jobid>/
.
我有一些代码是这样配置的:
import dask_jobqueue
from distributed import Client
cluster = dask_jobqueue.SLURMCluster(
queue = 'high',
cores = 24,
memory = '60GB',
walltime = '10:00:00',
local_directory = '/scratch/<username>/<jobid>/'
)
cluster.scale(1)
client = Client(cluster)
但是,我有一个问题。该目录不存在(既因为我不知道客户端将在哪个节点上,又因为它是根据 SLURM 作业 ID 创建的,它始终是唯一的)所以我的代码失败了:
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/home/lsterzin/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/lsterzin/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 191, in _run
target(*args, **kwargs)
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 699, in _run
worker = Worker(**worker_kwargs)
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/worker.py", line 497, in __init__
self._workspace = WorkSpace(os.path.abspath(local_directory))
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/diskutils.py", line 118, in __init__
self._init_workspace()
File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/diskutils.py", line 124, in _init_workspace
os.mkdir(self.base_dir)
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/<user>/<jobid>'
我无法在不知道 dask worker 运行 将在哪个节点的情况下创建目录,并且在目录已经存在的情况下我无法使用 dask_jobqueue
创建集群。解决此问题的最佳方法是什么?
我认为这可以用 /scratch/$USER/$SLURM_JOB_ID
来完成。如果这不起作用,也许可以通过配置文件定义 local-directory
:
https://jobqueue.dask.org/en/latest/configuration-setup.html#local-storage
示例配置也可能对您有用: https://jobqueue.dask.org/en/latest/configurations.html
感谢@lsterzinger 提出的措辞恰当的问题
我在这里推送了一个可能有帮助的修复:https://github.com/dask/distributed/pull/3928
我们看看社区怎么说