通过 Dask 将自定义作业脚本提交到 PBS?
Custom job script submission to PBS via Dask?
我有一个带有可执行文件的 PBS 作业脚本,可将结果写入 out 文件。
### some lines
PBS_O_EXEDIR="path/to/software"
EXECUTABLE="executablefile"
OUTFILE="out"
### Copy application directory on compute node
[ -d $PBS_O_EXEDIR ] || mkdir -p $PBS_O_EXEDIR
[ -w $PBS_O_EXEDIR ] && \
rsync -Cavz --rsh=$SSH $HOST:$PBS_O_EXEDIR `dirname $PBS_O_EXEDIR`
[ -d $PBS_O_WORKDIR ] || mkdir -p $PBS_O_WORKDIR
rsync -Cavz --rsh=$SSH $HOST:$PBS_O_WORKDIR `dirname $PBS_O_WORKDIR`
# Change into the working directory
cd $PBS_O_WORKDIR
# Save the jobid in the outfile
echo "PBS-JOB-ID was $PBS_JOBID" > $OUTFILE
# Run the executable
$PBS_O_EXEDIR/$EXECUTABLE >> $OUTFILE
在我的项目中,我必须使用 Dask 来提交作业并监控它们。因此,我配置了这样的 jobqueue.yaml 文件。
jobqueue:
pbs:
name: htc_calc
# Dask worker options
cores: 4 # Total number of cores per job
memory: 50GB # Total amount of memory per job
# PBS resource manager options
shebang: "#!/usr/bin/env bash"
walltime: '00:30:00'
exe_dir: "/home/r/rb11/softwares/FPLO/bin"
excutable: "fplo18.00-57-x86_64"
outfile: "out"
job-extra: "exe_dir/executable >> outfile"
但是,我在通过 Dask 提交作业时遇到了这个错误。
qsub: directive error: e
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f3d8c4a56a8>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py:284> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nqsub /tmp/tmpwyvkfcmi.sh\nstdout:\n\nstderr:\nqsub: directive error: e \n\n',)>)
Traceback (most recent call last):
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
future.result()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/asyncio/futures.py", line 294, in result
raise self._exception
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/asyncio/tasks.py", line 240, in _step
result = coro.send(None)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py", line 317, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py", line 41, in _
await self.start()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 285, in start
out = await self._submit_job(fn)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 268, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 368, in _call
"stderr:\n{}\n".format(proc.returncode, cmd_str, out, err)
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
qsub /tmp/tmpwyvkfcmi.sh
stdout:
stderr:
qsub: directive error: e
如何在 Dask 中指定自定义 bash 脚本?
Dask 用于分发 Python 应用程序。在 Dask Jobqueue 的情况下,它通过向批处理系统提交调度程序和工作程序来工作,它们连接在一起形成自己的集群。然后,您可以将 Python 工作提交给 Dask 调度程序。
从您的示例来看,您似乎正在尝试将集群设置配置用于 运行 您自己的 bash 应用程序而不是 Dask。
为了用 Dask 做到这一点,你应该 return 将作业队列配置为默认值,而不是编写一个 Python 调用你的 bash 脚本的函数。
from dask_jobqueue import PBSCluster
cluster = PBSCluster()
cluster.scale(jobs=10) # Deploy ten single-node jobs
from dask.distributed import Client
client = Client(cluster) # Connect this local process to remote workers
client.submit(os.system, "/path/to/your/script") # Run script on all workers
然而,Dask 似乎不太适合您正在尝试做的事情。您可能最好直接向 PBS 正常提交您的工作。
我有一个带有可执行文件的 PBS 作业脚本,可将结果写入 out 文件。
### some lines
PBS_O_EXEDIR="path/to/software"
EXECUTABLE="executablefile"
OUTFILE="out"
### Copy application directory on compute node
[ -d $PBS_O_EXEDIR ] || mkdir -p $PBS_O_EXEDIR
[ -w $PBS_O_EXEDIR ] && \
rsync -Cavz --rsh=$SSH $HOST:$PBS_O_EXEDIR `dirname $PBS_O_EXEDIR`
[ -d $PBS_O_WORKDIR ] || mkdir -p $PBS_O_WORKDIR
rsync -Cavz --rsh=$SSH $HOST:$PBS_O_WORKDIR `dirname $PBS_O_WORKDIR`
# Change into the working directory
cd $PBS_O_WORKDIR
# Save the jobid in the outfile
echo "PBS-JOB-ID was $PBS_JOBID" > $OUTFILE
# Run the executable
$PBS_O_EXEDIR/$EXECUTABLE >> $OUTFILE
在我的项目中,我必须使用 Dask 来提交作业并监控它们。因此,我配置了这样的 jobqueue.yaml 文件。
jobqueue:
pbs:
name: htc_calc
# Dask worker options
cores: 4 # Total number of cores per job
memory: 50GB # Total amount of memory per job
# PBS resource manager options
shebang: "#!/usr/bin/env bash"
walltime: '00:30:00'
exe_dir: "/home/r/rb11/softwares/FPLO/bin"
excutable: "fplo18.00-57-x86_64"
outfile: "out"
job-extra: "exe_dir/executable >> outfile"
但是,我在通过 Dask 提交作业时遇到了这个错误。
qsub: directive error: e
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f3d8c4a56a8>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py:284> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nqsub /tmp/tmpwyvkfcmi.sh\nstdout:\n\nstderr:\nqsub: directive error: e \n\n',)>)
Traceback (most recent call last):
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
future.result()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/asyncio/futures.py", line 294, in result
raise self._exception
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/asyncio/tasks.py", line 240, in _step
result = coro.send(None)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py", line 317, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/distributed/deploy/spec.py", line 41, in _
await self.start()
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 285, in start
out = await self._submit_job(fn)
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 268, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/home/r/rb11/anaconda3/envs/htc/lib/python3.5/site-packages/dask_jobqueue/core.py", line 368, in _call
"stderr:\n{}\n".format(proc.returncode, cmd_str, out, err)
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
qsub /tmp/tmpwyvkfcmi.sh
stdout:
stderr:
qsub: directive error: e
如何在 Dask 中指定自定义 bash 脚本?
Dask 用于分发 Python 应用程序。在 Dask Jobqueue 的情况下,它通过向批处理系统提交调度程序和工作程序来工作,它们连接在一起形成自己的集群。然后,您可以将 Python 工作提交给 Dask 调度程序。
从您的示例来看,您似乎正在尝试将集群设置配置用于 运行 您自己的 bash 应用程序而不是 Dask。
为了用 Dask 做到这一点,你应该 return 将作业队列配置为默认值,而不是编写一个 Python 调用你的 bash 脚本的函数。
from dask_jobqueue import PBSCluster
cluster = PBSCluster()
cluster.scale(jobs=10) # Deploy ten single-node jobs
from dask.distributed import Client
client = Client(cluster) # Connect this local process to remote workers
client.submit(os.system, "/path/to/your/script") # Run script on all workers
然而,Dask 似乎不太适合您正在尝试做的事情。您可能最好直接向 PBS 正常提交您的工作。