PBS 作业相互依赖:一项作业开始,取消其他作业
PBS jobs inter-dependency: one job starts, cancel others
我想向集群上的多个队列提交模拟。一旦一个队列启动它,其他队列就会取消它。我知道它可能定义不明确,因为多个作业可能同时在多个队列上启动。
监控队列的 bash 脚本很可能可以做到这一点。提交作业的时候直接用qsub可以吗?
编辑:下面是一个使用 bash 脚本的工作示例。这可能不是最佳选择,因为它需要(缓慢的)磁盘访问。
#!/bin/bash -
#
# Exit in case of error
set -e
#
# Command-line argument is the name of the shared file
fid=$*
if [ -f ${HOME}/.dep_jobs/${fid} ]; then
echo "Given name already used, abort."
exit 1
else
echo "Initialize case."
touch ${HOME}/.dep_jobs/${fid}
fi
#
# Submit master job and retrieve the ID
echo "Submitting master job"
MID=$(qsub -l select=1:ncpus=1:mpiprocs=1 -q queue1 run.pbs)
echo ${MID##* }
#
# Add the ID to the shared file
ln -s ${HOME}/.dep_jobs/${fid} ${HOME}/.dep_jobs/${MID##* }
echo "M ${MID##* }" >> ${HOME}/.dep_jobs/${fid}
#
# Submit slave job and retrieve the ID
echo "Submitting slave job"
SID=$(qsub -l select=1:ncpus=1:mpiprocs=1 -q queue2 run.pbs)
echo ${SID##* }
#
# Add the ID to the shared file
ln -s ${HOME}/.dep_jobs/${fid} ${HOME}/.dep_jobs/${SID##* }
echo "S ${SID##* }" >> ${HOME}/.dep_jobs/${fid}
#
# Terminus, finalize case
echo "Finalize case"
echo "OK" >> ${HOME}/.dep_jobs/${fid}
提交的PBS脚本应该如下开头
#!/bin/bash
#PBS -S /bin/bash
#PBS -N Parallel
#
# Define shared file
shared_file=${HOME}/.dep_jobs/${PBS_JOBID}
#
# Read it until it finishes with "OK"
while [[ "$(more ${shared_file} | tail -n1)" != "OK" ]]; do
sleep 1
done
#
# Read master and slave job id
while read -r line
do
key=$(echo ${line} | awk '{print }')
if [ "$key" = "M" ]; then
MID=$(echo ${line} | awk '{print }')
elif [ "$key" = "S" ]; then
SID=$(echo ${line} | awk '{print }')
fi
done < ${shared_file}
#
# Current job is master or slave?
if [ ${PBS_JOBID} = ${MID} ]; then
key="M"
other="${SID}"
else
key="S"
other="${MID}"
fi
#
# Check the status of the other job
status="$(qstat ${other} | tail -n1 | awk '{print }')"
#
# I am running, if the other is in queue, qdel it
if [ "${status}" = "Q" ]; then
$(qdel ${other})
# If the other is running, we have race and only master survives
elif [ "${status}" = "R" ]; then
if [ "${key}" = "M" ]; then
$(qdel ${other})
else
exit
fi
else
echo "We should not be here"
exit
fi
#
# The simulation goes here
这是一个 运行SGE 调度程序的脚本。对于 PBS 调度程序,您需要进行一些最小的更改,例如使用
#PBS
而不是 #$
并将 $JOB_ID
更改为 $PBS_JOBID
。
同样对于 SGE 调度程序,更好的方法是 运行
qstat -u user_name -s p
命令只列出挂起的作业,但我找不到 PBS 调度程序的类似选项,因此假设它不存在,一种方法可能是将以下脚本用于您的模拟作业(您不需要任何主脚本):
#!/bin/bash
#$-N myjobName
#$-q queueName
#... some other options if needed
# get the list of all running jobs
myjobs="$(qstat -u username | cut -d " " -f1 | tail -n +3| tr '\n' ' ' )"
# from the above list remove the current job (use PBS_JOBID for PBS scheduler)
deljobs="$(echo "${myjobs/$JOB_ID/}")"
echo "List of all jobs: $myjobs"
echo "List of jobs to delete: $deljobs"
#delete all other jobs
qdel $deljobs
#run the desired commands/programs
date
您需要在 qstat 命令中根据您的用户名更改上述脚本中的用户名。
我还建议一次检查这些命令以确保它们 运行 在您的环境中正确。
以下是我在脚本中使用的命令的一些简要说明:
qstat -u username # check all running jobs
cut -d " " -f1 # extract JOBID for each job from the previous output (first column)
tail -n +3 # skip first 2 lines in the above output
tr '\n' ' ' # change new line character on space
echo "${myjobs/$JOB_ID/}" # from the string contained in $myjobs variable remove $JOB_ID
我想向集群上的多个队列提交模拟。一旦一个队列启动它,其他队列就会取消它。我知道它可能定义不明确,因为多个作业可能同时在多个队列上启动。
监控队列的 bash 脚本很可能可以做到这一点。提交作业的时候直接用qsub可以吗?
编辑:下面是一个使用 bash 脚本的工作示例。这可能不是最佳选择,因为它需要(缓慢的)磁盘访问。
#!/bin/bash -
#
# Exit in case of error
set -e
#
# Command-line argument is the name of the shared file
fid=$*
if [ -f ${HOME}/.dep_jobs/${fid} ]; then
echo "Given name already used, abort."
exit 1
else
echo "Initialize case."
touch ${HOME}/.dep_jobs/${fid}
fi
#
# Submit master job and retrieve the ID
echo "Submitting master job"
MID=$(qsub -l select=1:ncpus=1:mpiprocs=1 -q queue1 run.pbs)
echo ${MID##* }
#
# Add the ID to the shared file
ln -s ${HOME}/.dep_jobs/${fid} ${HOME}/.dep_jobs/${MID##* }
echo "M ${MID##* }" >> ${HOME}/.dep_jobs/${fid}
#
# Submit slave job and retrieve the ID
echo "Submitting slave job"
SID=$(qsub -l select=1:ncpus=1:mpiprocs=1 -q queue2 run.pbs)
echo ${SID##* }
#
# Add the ID to the shared file
ln -s ${HOME}/.dep_jobs/${fid} ${HOME}/.dep_jobs/${SID##* }
echo "S ${SID##* }" >> ${HOME}/.dep_jobs/${fid}
#
# Terminus, finalize case
echo "Finalize case"
echo "OK" >> ${HOME}/.dep_jobs/${fid}
提交的PBS脚本应该如下开头
#!/bin/bash
#PBS -S /bin/bash
#PBS -N Parallel
#
# Define shared file
shared_file=${HOME}/.dep_jobs/${PBS_JOBID}
#
# Read it until it finishes with "OK"
while [[ "$(more ${shared_file} | tail -n1)" != "OK" ]]; do
sleep 1
done
#
# Read master and slave job id
while read -r line
do
key=$(echo ${line} | awk '{print }')
if [ "$key" = "M" ]; then
MID=$(echo ${line} | awk '{print }')
elif [ "$key" = "S" ]; then
SID=$(echo ${line} | awk '{print }')
fi
done < ${shared_file}
#
# Current job is master or slave?
if [ ${PBS_JOBID} = ${MID} ]; then
key="M"
other="${SID}"
else
key="S"
other="${MID}"
fi
#
# Check the status of the other job
status="$(qstat ${other} | tail -n1 | awk '{print }')"
#
# I am running, if the other is in queue, qdel it
if [ "${status}" = "Q" ]; then
$(qdel ${other})
# If the other is running, we have race and only master survives
elif [ "${status}" = "R" ]; then
if [ "${key}" = "M" ]; then
$(qdel ${other})
else
exit
fi
else
echo "We should not be here"
exit
fi
#
# The simulation goes here
这是一个 运行SGE 调度程序的脚本。对于 PBS 调度程序,您需要进行一些最小的更改,例如使用
#PBS
而不是 #$
并将 $JOB_ID
更改为 $PBS_JOBID
。
同样对于 SGE 调度程序,更好的方法是 运行
qstat -u user_name -s p
命令只列出挂起的作业,但我找不到 PBS 调度程序的类似选项,因此假设它不存在,一种方法可能是将以下脚本用于您的模拟作业(您不需要任何主脚本):
#!/bin/bash
#$-N myjobName
#$-q queueName
#... some other options if needed
# get the list of all running jobs
myjobs="$(qstat -u username | cut -d " " -f1 | tail -n +3| tr '\n' ' ' )"
# from the above list remove the current job (use PBS_JOBID for PBS scheduler)
deljobs="$(echo "${myjobs/$JOB_ID/}")"
echo "List of all jobs: $myjobs"
echo "List of jobs to delete: $deljobs"
#delete all other jobs
qdel $deljobs
#run the desired commands/programs
date
您需要在 qstat 命令中根据您的用户名更改上述脚本中的用户名。 我还建议一次检查这些命令以确保它们 运行 在您的环境中正确。
以下是我在脚本中使用的命令的一些简要说明:
qstat -u username # check all running jobs
cut -d " " -f1 # extract JOBID for each job from the previous output (first column)
tail -n +3 # skip first 2 lines in the above output
tr '\n' ' ' # change new line character on space
echo "${myjobs/$JOB_ID/}" # from the string contained in $myjobs variable remove $JOB_ID