bash 中的 Nagios 事件处理程序脚本以重新启动服务,如果它没有启动,则在满足条件之前不要重新启动下一个

Nagios event handler script in bash to restart service, if its not started dont restart next one until condition is met

嗨,Whosebug 社区,

我需要有关 bash 脚本的帮助,因为我是新手。 我想要完成的是,我们有一个 windows 服务器,有时它会占用 90% 的内存,所以每当 nagios 捕获它时,我们想通过 nrpe 重新启动这些服务。但在重新启动所有服务之前,第一个服务必须启动,一旦启动,就继续下一个服务重新启动。

另一种选择是停止所有 4 个服务,然后依次启动它们。

这是我写的脚本:

case "" in
OK)
;;
WARNING)
;;
UNKNOWN)
;;
CRITICAL) ## DECISION ENGINE RESTART
echo -n "Restarting Decision Engine_1"
cat /usr/local/nagios/libexec/mail/DeServiceRestart.txt | mail -s "Restarting DE services" email@someteam.com -r Nagios@ATL-NM-01
/usr/local/nagios/libexec/check_nrpe -H "" -p 5666 -c restart_service -a DecisionEngine_1;
if /usr/local/nagios/libexec/check_nrpe -H "" -t 30 -c check_service -a DecisionEngine_1 'crit=not state_is_ok()' > OK:
then
echo -n "Restarting Decision Engine_2"
/usr/local/nagios/libexec/check_nrpe -H "" -p 5666 -c restart_service -a DecisionEngine_2
if /usr/local/nagios/libexec/check_nrpe -H "" -t 30 -c check_service -a DecisionEngine_2 'crit=not state_is_ok()' > OK:
then
echo -n "Restarting Decision Engine_3"
/usr/local/nagios/libexec/check_nrpe -H "" -p 5666 -c restart_service -a DecisionEngine_3
if /usr/local/nagios/libexec/check_nrpe -H "" -t 30 -c check_service -a DecisionEngine_3 'crit=not state_is_ok()' > OK:
then
echo -n "Restarting Decision Engine_4"
/usr/local/nagios/libexec/check_nrpe -H "" -p 5666 -c restart_service -a DecisionEngine_4
else
   echo " Restart is complete"
fi
;;
esac
exit 0

不确定我在哪里犯了错误,希望得到任何反馈。

谢谢!

所有评论都在代码中。仔细检查 StopService 函数,因为你没有提到如何停止服务的方式,所以我做了类似的。

#!/bin/bash

SERVICESTATE=;      #Common Check State (OK,WARNING,CRITICAL or UNKNOWN)
Host=;              #HostName or IP
SERVICESTATETYPE=;  #Hard or Soft service type

TimeOut=3;            #Time (seconds) to wait service start/stop 
                      #before next service processing
                      #You could not make infinite TimeOut, because 
                      #nagios process will kill this handler if it 
                      #will run too long


#Services is array with service names
Services=(DecisionEngine_1 DecisionEngine_2 DecisionEngine_3 DecisionEngine_4)

#add path to nagios plugins dir
PATH=$PATH:/usr/local/nagios/libexec

RestartService() {
   #function restarts services via NRPE.
   #Usage:  RestartService ServiceName
   echo -n " Restarting ;"
   check_nrpe -H "${Host}" -p 5666 -c restart_service -a "" >/dev/null 2>&1
   return $?
}

StopService() {
   #function stops services via NRPE.
   #Usage: StopService ServiceName
   echo -n " Stopping ;"
   check_nrpe -H "${Host}" -p 5666 -c stop_service -a "" >/dev/null 2>&1
   return $?
}

ServiceWait() {
   #function do continious checks service via NRPE, until success,
   #unsuccess check or TimeOut 
   #Usage:  ServiceWait ServiceName {start|stop}
   #start optin waits for success check
   #stop option waits for unsuccess check
   Logic="";
   [ "" == "start" ] && Logic="-eq"; #RC for start check should be 0
   [ "" == "stop" ] && Logic="-ne" ; #RC for stop check should NOT be 0
   [ -z "$Logic" ] && { echo "ServiceWait function usage error"; exit 19; }
   t=${TimeOut}
   while [ "$t" -ge 0 ]; do
      check_nrpe -H "${Host}" -p 5666 -t 30 \
                 -c check_service -a "" 'crit=not state_is_ok()' >/dev/null 2>&1
      RC=$?
      [ "$RC" $Logic 0 ] && { echo -n "CheckRC=$RC;"; return $RC; }      
                              #success check, no need to wait anymore
      let t--
      sleep 1
   done
   echo -n "TimeOut; " 
   return 3
}

#check if script received zero params in ,  and 
[ -z "${SERVICESTATE}" -o -z "${Host}" -o -z "${SERVICESTATETYPE}" ] && { 
    echo "Usage: [=10=] {OK|WARNING|UNKNOWN|CRITICAL} Hostname {SOFT|HARD}"; 
    exit 1; 
  }

case "${SERVICESTATE}" in
   OK)
   ;;
   WARNING)
   ;;
   UNKNOWN)
   ;;
   CRITICAL) ## DECISION ENGINE RESTART
     #uncomment if you need @mail
     #cat /usr/local/nagios/libexec/mail/DeServiceRestart.txt | \
     # mail -s "Restarting DE services" email@someteam.com -r Nagios@ATL-NM-01
     RC=0

     if [ "$SERVICESTATETYPE" == "SOFT" ] ; then
        for (( i=0; i<${#Services[*]}; i++ )); do
           RestartService ${Services[$i]}
           ServiceWait ${Services[$i]} start
           RC=$?
           #if previous check failed, then do not try to do any restarts anymore
           [ "$RC" -ne 0 ] && break;         
           SuccessRestart+=(${Services[$i]})
        done
        echo "Restart is complete. ${SuccessRestart[*]} Return Code is ${RC}"
     elif [ "$SERVICESTATETYPE" == "HARD" ] ; then
        #Stop all services sequentially.
        for (( i=0; i<${#Services[*]}; i++ )); do
           StopService ${Services[$i]}
           #Here you need to experiment what to wait
           #May be it will be better to stay here for N seconds while
           #service is been stopped
           #rather then try to check service state
           ServiceWait ${Services[$i]} stop
           #sleep $TimeOut
        done
        #Start all services sequentially.
        for (( i=0; i<${#Services[*]}; i++ )); do
           RestartService ${Services[$i]}
           ServiceWait ${Services[$i]} start
           RC=$?
           #if previous check failed, then do not try to do any restarts anymore
           [ "$RC" -ne 0 ] && break;      
           SuccessRestart+=(${Services[$i]})
        done
     else
         echo "Unknown SERVICESTATETYPE $SERVICESTATETYPE option" 
         exit 20
     fi
   ;;
esac
exit 0