LSF 中 jStatus 日志值的含义

Meaning of jStatus log values in LSF

我目前正在尝试破译 lsb.events 日志文件的内容,该文件由 Platform Computing "Platform Process Manager"(流程管理器)8.1 版创建。

从各种 sources 文档中,我看到以下对 jStatus 变量的描述:

但是在JOB_STATUS条目中,也有jStatus值为2和192。这些值代表什么?

标记 SAS,因为此实现与其捆绑在一起。顺便说一句,我观察到在某些情况下,我们 lsb.events 文件中的实际字段与根据上述文档应该出现的字段不一致..

状态 2 表示作业处于 PSUSP 状态,可通过多种方式实现(例如,使用 -H 选项提交作业以阻止其进行调度)。

对于192,答案是作业状态是位域。在这种情况下,设置了 2 位:

  • 64 = JOB_STAT_DONE
  • 128 = JOB_STAT_PDONE

JOB_STAT_PDONE 表示作业定义了 post 执行脚本并且已成功完成。

作业状态位的有效值位于包含目录中随 LSF 一起提供的 lsf/lsbatch.h 文件中:<LSF_INSTALL_DIR>/<LSF_VERSION>/include/lsf/lsbatch.h

为了扩展,感谢@Squirrel,我们 C:\LSF_7.0.0\include\lsf\lsbatch.h 文件的相关内容是:

/**  * \addtogroup job_states job_states  * define job states  */ /*@{*/
#define JOB_STAT_NULL         0x00       /**< State null*/
#define JOB_STAT_PEND         0x01       /**< The job is pending, i.e., it 
                                            * has not been dispatched yet.*/
#define JOB_STAT_PSUSP        0x02       /**< The pending job was suspended by its
                                            * owner or the LSF system administrator.*/
#define JOB_STAT_RUN          0x04       /**< The job is running.*/
#define JOB_STAT_SSUSP        0x08       /**< The running job was suspended 
                                           * by the system because an execution 
                                           * host was overloaded or the queue run 
                                           * window closed. (see \ref lsb_queueinfo, 
                                           * \ref lsb_hostinfo, and lsb.queues.)
                                           */
#define JOB_STAT_USUSP        0x10       /**< The running job was suspended by its 
                                           * owner or the LSF system administrator.*/
#define JOB_STAT_EXIT         0x20       /**< The job has terminated with a non-zero
                                           * status - it may have been aborted due 
                                           * to an error in its execution, or 
                                           * killed by its owner or by the 
                                           * LSF system administrator.*/
#define JOB_STAT_DONE         0x40       /**< The job has terminated with status 0.*/
#define JOB_STAT_PDONE        (0x80)     /**< Post job process done successfully */
#define JOB_STAT_PERR         (0x100)    /**< Post job process has error */
#define JOB_STAT_WAIT         (0x200)    /**< Chunk job waiting its turn to exec */
#define JOB_STAT_RUNKWN       0x8000     /* Flag : Job status is UNKWN caused by 
                                          * losting contact with remote cluster */
#define JOB_STAT_UNKWN        0x10000    /**< The slave batch daemon (sbatchd) on 
                                          * the host on which the job is processed 
                                          * has lost contact with the master batch 
                                          * daemon (mbatchd).*/

再次,十进制:

0       JOB_STAT_NULL
1       JOB_STAT_PEND
2       JOB_STAT_PSUSP
4       JOB_STAT_RUN
8       JOB_STAT_SSUSP 
16      JOB_STAT_USUSP 
32      JOB_STAT_EXIT 
64      JOB_STAT_DONE
128     JOB_STAT_PDONE 
256     JOB_STAT_PERR 
512     JOB_STAT_WAIT
32768   JOB_STAT_RUNKWN 
65536   JOB_STAT_UNKWN