SLURM 如何在另一个任务完成时对任务进行 qsub?
SLURM how to qsub a task when another task is finished?
我目前使用的是基于 Linux 的 HPC,它只使用 SLURM 提交作业,并且 HPC 只允许作业 运行 12 小时。但是,我可能需要连续 运行 24 个工作一周才能有好的结果。
有没有办法在作业完成后再次(自动)运行?
亲切的问候
添加:
作业完成后,将创建一个 .out 文件。也就是说.out文件的个数会增加1。
增加.out个数后是否可以重新排队?
#!/bin/bash
#!
#! Example SLURM job script for Darwin (Sandy Bridge, ConnectX3)
#! Last updated: Sat Apr 18 13:05:53 BST 2015
#!
#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################
#! sbatch directives begin here ###############################
#! Name of the job:
#SBATCH -J Validation
#! Which project should be charged:
#SBATCH -A SOGA
#! How many whole nodes should be allocated?
#SBATCH --nodes=1
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=1
#!SBATCH --mem=200
#! How much wallclock time will be required?
#SBATCH --time=12:00:00
#SBATCH --mail-user=zl352
#SBATCH --mail-type=ALL
#! Uncomment this to prevent the job from being requeued (e.g. if
#! interrupted by node failure or system downtime):
##SBATCH --no-requeue
#! Do not change:
#SBATCH -p sandybridge
#! sbatch directives end here (put any additional directives above this line)
#! Notes:
#! Charging is determined by core number*walltime.
#! The --ntasks value refers to the number of tasks to be launched by SLURM only. This
#! usually equates to the number of MPI tasks launched. Reduce this from nodes*16 if
#! demanded by memory requirements, or if OMP_NUM_THREADS>1.
#! Each task is allocated 1 core by default, and each core is allocated 3994MB. If this
#! is insufficient, also specify --cpus-per-task and/or --mem (the latter specifies
#! MB per node).
#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$(echo "$SLURM_TASKS_PER_NODE" | sed -e 's/^\([0-9][0-9]*\).*$//')
#! ############################################################
#! Modify the settings below to specify the application's environment, location
#! and launch method:
#! Optionally modify the environment seen by the application
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):
. /etc/profile.d/modules.sh # Leave this line (enables the module command)
module purge # Removes all modules still loaded
module load default-impi # REQUIRED - loads the basic environment
#! Insert additional module load commands after this line if needed:
#! Full path to application executable:
application="~/scratch/code7/viv"
#! Run options for the application:
options=" > test.e"
#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR" # The value of SLURM_SUBMIT_DIR sets workdir to the directory
# in which sbatch is run.
#! Are you using OpenMP (NB this is unrelated to OpenMPI)? If so increase this
#! safe value to no more than 16:
export OMP_NUM_THREADS=1
#! Number of MPI tasks to be started by the application per node and in total (do not change):
np=$[${numnodes}*${mpi_tasks_per_node}]
#! The following variables define a sensible pinning strategy for Intel MPI tasks -
#! this should be suitable for both pure MPI and hybrid MPI/OpenMP jobs:
export I_MPI_PIN_DOMAIN=omp:compact # Domains are $OMP_NUM_THREADS cores in size
export I_MPI_PIN_ORDER=scatter # Adjacent domains have minimal sharing of caches/sockets
#! Notes:
#! 1. These variables influence Intel MPI only.
#! 2. Domains are non-overlapping sets of cores which map 1-1 to MPI tasks.
#! 3. I_MPI_PIN_PROCESSOR_LIST is ignored if I_MPI_PIN_DOMAIN is set.
#! 4. If MPI tasks perform better when sharing caches/sockets, try I_MPI_PIN_ORDER=compact.
#! Uncomment one choice for CMD below (add mpirun/mpiexec options if necessary):
#! Choose this for a MPI code (possibly using OpenMP) using Intel MPI.
#!CMD="mpirun -ppn $mpi_tasks_per_node -np $np $application $options"
#! Choose this for a pure shared-memory OpenMP parallel program on a single node:
#! (OMP_NUM_THREADS threads will be created):
CMD="$application $options"
#! Choose this for a MPI code (possibly using OpenMP) using OpenMPI:
#!CMD="mpirun -npernode $mpi_tasks_per_node -np $np $application $options"
###############################################################
### You should not have to change anything below this line ####
###############################################################
cd $workdir
echo -e "Changed directory to `pwd`.\n"
JOBID=$SLURM_JOB_ID
echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"
if [ "$SLURM_JOB_NODELIST" ]; then
#! Create a machine file:
export NODEFILE=`generate_pbs_nodefile`
cat $NODEFILE | uniq > machine.file.$JOBID
echo -e "\nNodes allocated:\n================"
echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi
echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"
echo -e "\nExecuting command:\n==================\n$CMD\n"
eval $CMD
如果您的作业本质上是可重启的,您需要做的就是在提交脚本的末尾调用 sbatch
。假设它被称为 submit.sh
if ! job_is_done;
then
sbatch submit.sh
fi
job_is_done
部分应替换为 returns 0 作业完成时(即计算完成、过程收敛等)的命令,例如 'grepping'某些线索的日志文件。
您也可以重新排队作业:
job_is_done || scontrol requeue $SLURM_JOB_ID
如果您的程序本质上不是可重启的,您可以使用 DMCTP 等包装器使其可重启。
我目前使用的是基于 Linux 的 HPC,它只使用 SLURM 提交作业,并且 HPC 只允许作业 运行 12 小时。但是,我可能需要连续 运行 24 个工作一周才能有好的结果。
有没有办法在作业完成后再次(自动)运行?
亲切的问候
添加:
作业完成后,将创建一个 .out 文件。也就是说.out文件的个数会增加1。
增加.out个数后是否可以重新排队?
#!/bin/bash
#!
#! Example SLURM job script for Darwin (Sandy Bridge, ConnectX3)
#! Last updated: Sat Apr 18 13:05:53 BST 2015
#!
#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################
#! sbatch directives begin here ###############################
#! Name of the job:
#SBATCH -J Validation
#! Which project should be charged:
#SBATCH -A SOGA
#! How many whole nodes should be allocated?
#SBATCH --nodes=1
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=1
#!SBATCH --mem=200
#! How much wallclock time will be required?
#SBATCH --time=12:00:00
#SBATCH --mail-user=zl352
#SBATCH --mail-type=ALL
#! Uncomment this to prevent the job from being requeued (e.g. if
#! interrupted by node failure or system downtime):
##SBATCH --no-requeue
#! Do not change:
#SBATCH -p sandybridge
#! sbatch directives end here (put any additional directives above this line)
#! Notes:
#! Charging is determined by core number*walltime.
#! The --ntasks value refers to the number of tasks to be launched by SLURM only. This
#! usually equates to the number of MPI tasks launched. Reduce this from nodes*16 if
#! demanded by memory requirements, or if OMP_NUM_THREADS>1.
#! Each task is allocated 1 core by default, and each core is allocated 3994MB. If this
#! is insufficient, also specify --cpus-per-task and/or --mem (the latter specifies
#! MB per node).
#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$(echo "$SLURM_TASKS_PER_NODE" | sed -e 's/^\([0-9][0-9]*\).*$//')
#! ############################################################
#! Modify the settings below to specify the application's environment, location
#! and launch method:
#! Optionally modify the environment seen by the application
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):
. /etc/profile.d/modules.sh # Leave this line (enables the module command)
module purge # Removes all modules still loaded
module load default-impi # REQUIRED - loads the basic environment
#! Insert additional module load commands after this line if needed:
#! Full path to application executable:
application="~/scratch/code7/viv"
#! Run options for the application:
options=" > test.e"
#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR" # The value of SLURM_SUBMIT_DIR sets workdir to the directory
# in which sbatch is run.
#! Are you using OpenMP (NB this is unrelated to OpenMPI)? If so increase this
#! safe value to no more than 16:
export OMP_NUM_THREADS=1
#! Number of MPI tasks to be started by the application per node and in total (do not change):
np=$[${numnodes}*${mpi_tasks_per_node}]
#! The following variables define a sensible pinning strategy for Intel MPI tasks -
#! this should be suitable for both pure MPI and hybrid MPI/OpenMP jobs:
export I_MPI_PIN_DOMAIN=omp:compact # Domains are $OMP_NUM_THREADS cores in size
export I_MPI_PIN_ORDER=scatter # Adjacent domains have minimal sharing of caches/sockets
#! Notes:
#! 1. These variables influence Intel MPI only.
#! 2. Domains are non-overlapping sets of cores which map 1-1 to MPI tasks.
#! 3. I_MPI_PIN_PROCESSOR_LIST is ignored if I_MPI_PIN_DOMAIN is set.
#! 4. If MPI tasks perform better when sharing caches/sockets, try I_MPI_PIN_ORDER=compact.
#! Uncomment one choice for CMD below (add mpirun/mpiexec options if necessary):
#! Choose this for a MPI code (possibly using OpenMP) using Intel MPI.
#!CMD="mpirun -ppn $mpi_tasks_per_node -np $np $application $options"
#! Choose this for a pure shared-memory OpenMP parallel program on a single node:
#! (OMP_NUM_THREADS threads will be created):
CMD="$application $options"
#! Choose this for a MPI code (possibly using OpenMP) using OpenMPI:
#!CMD="mpirun -npernode $mpi_tasks_per_node -np $np $application $options"
###############################################################
### You should not have to change anything below this line ####
###############################################################
cd $workdir
echo -e "Changed directory to `pwd`.\n"
JOBID=$SLURM_JOB_ID
echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"
if [ "$SLURM_JOB_NODELIST" ]; then
#! Create a machine file:
export NODEFILE=`generate_pbs_nodefile`
cat $NODEFILE | uniq > machine.file.$JOBID
echo -e "\nNodes allocated:\n================"
echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi
echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"
echo -e "\nExecuting command:\n==================\n$CMD\n"
eval $CMD
如果您的作业本质上是可重启的,您需要做的就是在提交脚本的末尾调用 sbatch
。假设它被称为 submit.sh
if ! job_is_done;
then
sbatch submit.sh
fi
job_is_done
部分应替换为 returns 0 作业完成时(即计算完成、过程收敛等)的命令,例如 'grepping'某些线索的日志文件。
您也可以重新排队作业:
job_is_done || scontrol requeue $SLURM_JOB_ID
如果您的程序本质上不是可重启的,您可以使用 DMCTP 等包装器使其可重启。