扭矩工作随机死亡
Torque job randomly dying
我是 运行 一个 Python3 脚本,每批大约有 27 个脚本,每个脚本都有不同的输入。然后将结果保存到 results/$sizex$size 文件夹中。工作目录也必须更改为该文件夹,以便程序可以保存一些图像和数据。
这是我的 pbs 脚本:
#!/bin/bash
#PBS -l nodes=1:ppn=28
#PBS -l mem=16gb
#PBS -l walltime=120:00:00
cd $PBS_O_WORKDIR
mkdir -p results
module purge
module load newmodules/1.0-Lmod GCC/6.3.0-2.27 OpenMPI/2.0.2
module load Python/3.6.1
j=0
for i in $(seq 2 2 1024); do
if [ "$j" -gt "28" ]; then
wait;
j=0;
fi
cd results
mkdir -p $i"x"$i
cd $i"x"$i
time python3 $PBS_O_WORKDIR/model.py $i > result.txt &
cd $PBS_O_WORKDIR
((j++))
((j++))
done
wait
这些是我从 运行 tracejob
:
获得的日志
kill_task: not killing process (pid=142457/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142458 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142460/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142461 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142463/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142464 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142466/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142467 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142469/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142470 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142472/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142473 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142475/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142476 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142478/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142479 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142481/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142482 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142483/state=Z) with sig 15
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142442/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142445/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142448/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142451/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142454/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142457/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142460/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142463/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142466/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142469/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142472/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142475/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142478/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142481/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142483/state=Z) with sig 9
02/27/2018 21:42:26.788 M scan_for_terminated: job 2205476.example.com task 1 terminated, sid=121918
02/27/2018 21:42:26.788 M job was terminated
02/27/2018 21:42:26.818 M obit sent to server
02/27/2018 21:42:26.882 M removed job scrip
作业运行一个小时左右然后随机死亡。我不确定为什么。我试过增加挂墙时间,但这没有做任何事情。
基本上我的 python 脚本读取 2 到 1024 之间的 2 的倍数,并且每个脚本并行运行(以 27 个为一组以避免服务器 crashing/swapping 出局)。谁能建议为什么会这样?
所以我使用 GNU parallel 解决了这个问题。在某些服务器上,您可能需要像这样加载模块:module load gnu-parallel
然后在 pbs 脚本中我简单地删除了 for 循环:
parallel -j28 python3 model.py {1} > results{1}.txt ::: $(seq 100 -2 2)
我还必须更改程序中的工作目录,以免结果被覆盖。
我是 运行 一个 Python3 脚本,每批大约有 27 个脚本,每个脚本都有不同的输入。然后将结果保存到 results/$sizex$size 文件夹中。工作目录也必须更改为该文件夹,以便程序可以保存一些图像和数据。
这是我的 pbs 脚本:
#!/bin/bash
#PBS -l nodes=1:ppn=28
#PBS -l mem=16gb
#PBS -l walltime=120:00:00
cd $PBS_O_WORKDIR
mkdir -p results
module purge
module load newmodules/1.0-Lmod GCC/6.3.0-2.27 OpenMPI/2.0.2
module load Python/3.6.1
j=0
for i in $(seq 2 2 1024); do
if [ "$j" -gt "28" ]; then
wait;
j=0;
fi
cd results
mkdir -p $i"x"$i
cd $i"x"$i
time python3 $PBS_O_WORKDIR/model.py $i > result.txt &
cd $PBS_O_WORKDIR
((j++))
((j++))
done
wait
这些是我从 运行 tracejob
:
kill_task: not killing process (pid=142457/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142458 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142460/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142461 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142463/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142464 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142466/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142467 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142469/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142470 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142472/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142473 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142475/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142476 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142478/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142479 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142481/state=Z) with sig 15
02/27/2018 21:42:26.758 M kill_task: killing pid 142482 task 1 with sig 15
02/27/2018 21:42:26.758 M kill_task: not killing process (pid=142483/state=Z) with sig 15
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142442/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142445/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142448/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142451/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142454/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142457/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142460/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142463/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142466/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142469/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142472/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142475/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142478/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142481/state=Z) with sig 9
02/27/2018 21:42:26.788 M kill_task: not killing process (pid=142483/state=Z) with sig 9
02/27/2018 21:42:26.788 M scan_for_terminated: job 2205476.example.com task 1 terminated, sid=121918
02/27/2018 21:42:26.788 M job was terminated
02/27/2018 21:42:26.818 M obit sent to server
02/27/2018 21:42:26.882 M removed job scrip
作业运行一个小时左右然后随机死亡。我不确定为什么。我试过增加挂墙时间,但这没有做任何事情。
基本上我的 python 脚本读取 2 到 1024 之间的 2 的倍数,并且每个脚本并行运行(以 27 个为一组以避免服务器 crashing/swapping 出局)。谁能建议为什么会这样?
所以我使用 GNU parallel 解决了这个问题。在某些服务器上,您可能需要像这样加载模块:module load gnu-parallel
然后在 pbs 脚本中我简单地删除了 for 循环:
parallel -j28 python3 model.py {1} > results{1}.txt ::: $(seq 100 -2 2)
我还必须更改程序中的工作目录,以免结果被覆盖。