LSF 中 jStatus 日志值的含义
Meaning of jStatus log values in LSF
我目前正在尝试破译 lsb.events 日志文件的内容,该文件由 Platform Computing "Platform Process Manager"(流程管理器)8.1 版创建。
从各种 sources 文档中,我看到以下对 jStatus 变量的描述:
- 4=运行
- 32=JOB_STAT_EXIT
- 64=JOB_STAT_DONE
但是在JOB_STATUS条目中,也有jStatus值为2和192。这些值代表什么?
标记 SAS,因为此实现与其捆绑在一起。顺便说一句,我观察到在某些情况下,我们 lsb.events 文件中的实际字段与根据上述文档应该出现的字段不一致..
状态 2 表示作业处于 PSUSP 状态,可通过多种方式实现(例如,使用 -H 选项提交作业以阻止其进行调度)。
对于192,答案是作业状态是位域。在这种情况下,设置了 2 位:
- 64 = JOB_STAT_DONE
- 128 = JOB_STAT_PDONE
JOB_STAT_PDONE 表示作业定义了 post 执行脚本并且已成功完成。
作业状态位的有效值位于包含目录中随 LSF 一起提供的 lsf/lsbatch.h
文件中:<LSF_INSTALL_DIR>/<LSF_VERSION>/include/lsf/lsbatch.h
为了扩展,感谢@Squirrel,我们 C:\LSF_7.0.0\include\lsf\lsbatch.h
文件的相关内容是:
/** * \addtogroup job_states job_states * define job states */ /*@{*/
#define JOB_STAT_NULL 0x00 /**< State null*/
#define JOB_STAT_PEND 0x01 /**< The job is pending, i.e., it
* has not been dispatched yet.*/
#define JOB_STAT_PSUSP 0x02 /**< The pending job was suspended by its
* owner or the LSF system administrator.*/
#define JOB_STAT_RUN 0x04 /**< The job is running.*/
#define JOB_STAT_SSUSP 0x08 /**< The running job was suspended
* by the system because an execution
* host was overloaded or the queue run
* window closed. (see \ref lsb_queueinfo,
* \ref lsb_hostinfo, and lsb.queues.)
*/
#define JOB_STAT_USUSP 0x10 /**< The running job was suspended by its
* owner or the LSF system administrator.*/
#define JOB_STAT_EXIT 0x20 /**< The job has terminated with a non-zero
* status - it may have been aborted due
* to an error in its execution, or
* killed by its owner or by the
* LSF system administrator.*/
#define JOB_STAT_DONE 0x40 /**< The job has terminated with status 0.*/
#define JOB_STAT_PDONE (0x80) /**< Post job process done successfully */
#define JOB_STAT_PERR (0x100) /**< Post job process has error */
#define JOB_STAT_WAIT (0x200) /**< Chunk job waiting its turn to exec */
#define JOB_STAT_RUNKWN 0x8000 /* Flag : Job status is UNKWN caused by
* losting contact with remote cluster */
#define JOB_STAT_UNKWN 0x10000 /**< The slave batch daemon (sbatchd) on
* the host on which the job is processed
* has lost contact with the master batch
* daemon (mbatchd).*/
再次,十进制:
0 JOB_STAT_NULL
1 JOB_STAT_PEND
2 JOB_STAT_PSUSP
4 JOB_STAT_RUN
8 JOB_STAT_SSUSP
16 JOB_STAT_USUSP
32 JOB_STAT_EXIT
64 JOB_STAT_DONE
128 JOB_STAT_PDONE
256 JOB_STAT_PERR
512 JOB_STAT_WAIT
32768 JOB_STAT_RUNKWN
65536 JOB_STAT_UNKWN
我目前正在尝试破译 lsb.events 日志文件的内容,该文件由 Platform Computing "Platform Process Manager"(流程管理器)8.1 版创建。
从各种 sources 文档中,我看到以下对 jStatus 变量的描述:
- 4=运行
- 32=JOB_STAT_EXIT
- 64=JOB_STAT_DONE
但是在JOB_STATUS条目中,也有jStatus值为2和192。这些值代表什么?
标记 SAS,因为此实现与其捆绑在一起。顺便说一句,我观察到在某些情况下,我们 lsb.events 文件中的实际字段与根据上述文档应该出现的字段不一致..
状态 2 表示作业处于 PSUSP 状态,可通过多种方式实现(例如,使用 -H 选项提交作业以阻止其进行调度)。
对于192,答案是作业状态是位域。在这种情况下,设置了 2 位:
- 64 = JOB_STAT_DONE
- 128 = JOB_STAT_PDONE
JOB_STAT_PDONE 表示作业定义了 post 执行脚本并且已成功完成。
作业状态位的有效值位于包含目录中随 LSF 一起提供的 lsf/lsbatch.h
文件中:<LSF_INSTALL_DIR>/<LSF_VERSION>/include/lsf/lsbatch.h
为了扩展,感谢@Squirrel,我们 C:\LSF_7.0.0\include\lsf\lsbatch.h
文件的相关内容是:
/** * \addtogroup job_states job_states * define job states */ /*@{*/
#define JOB_STAT_NULL 0x00 /**< State null*/
#define JOB_STAT_PEND 0x01 /**< The job is pending, i.e., it
* has not been dispatched yet.*/
#define JOB_STAT_PSUSP 0x02 /**< The pending job was suspended by its
* owner or the LSF system administrator.*/
#define JOB_STAT_RUN 0x04 /**< The job is running.*/
#define JOB_STAT_SSUSP 0x08 /**< The running job was suspended
* by the system because an execution
* host was overloaded or the queue run
* window closed. (see \ref lsb_queueinfo,
* \ref lsb_hostinfo, and lsb.queues.)
*/
#define JOB_STAT_USUSP 0x10 /**< The running job was suspended by its
* owner or the LSF system administrator.*/
#define JOB_STAT_EXIT 0x20 /**< The job has terminated with a non-zero
* status - it may have been aborted due
* to an error in its execution, or
* killed by its owner or by the
* LSF system administrator.*/
#define JOB_STAT_DONE 0x40 /**< The job has terminated with status 0.*/
#define JOB_STAT_PDONE (0x80) /**< Post job process done successfully */
#define JOB_STAT_PERR (0x100) /**< Post job process has error */
#define JOB_STAT_WAIT (0x200) /**< Chunk job waiting its turn to exec */
#define JOB_STAT_RUNKWN 0x8000 /* Flag : Job status is UNKWN caused by
* losting contact with remote cluster */
#define JOB_STAT_UNKWN 0x10000 /**< The slave batch daemon (sbatchd) on
* the host on which the job is processed
* has lost contact with the master batch
* daemon (mbatchd).*/
再次,十进制:
0 JOB_STAT_NULL
1 JOB_STAT_PEND
2 JOB_STAT_PSUSP
4 JOB_STAT_RUN
8 JOB_STAT_SSUSP
16 JOB_STAT_USUSP
32 JOB_STAT_EXIT
64 JOB_STAT_DONE
128 JOB_STAT_PDONE
256 JOB_STAT_PERR
512 JOB_STAT_WAIT
32768 JOB_STAT_RUNKWN
65536 JOB_STAT_UNKWN