带有 nrpe 的自定义 nagios 脚本导致非零退出状态 1
custom nagios script with nrpe resulting in non-zero exit status 1
我正在尝试 运行 一个 python 脚本使用 nrpe 来监控 rabbitmq。脚本内部是一个命令 'sudo rabbiqmqctl list_queues',它为我提供了每个队列的消息计数。然而,这导致 nagios 给出 htis 消息:
CRITICAL - Command '['sudo', 'rabbitmqctl', 'list_queues']' returned non-zero exit status 1
我认为这可能是权限问题,所以按以下方式进行
/etc/group:
ec2-user:x:500:
rabbitmq:x:498:nrpe,nagios,ec2-user
nagios:x:497:
nrpe:x:496:
rpc:x:32:
/etc/sudoers:
%rabbitmq ALL=NOPASSWD: /usr/sbin/rabbitmqctl
nagios 配置:
command[check_rabbitmq_queuecount_prod]=/usr/bin/python27 /etc/nagios/check_rabbitmq_prod -a queues_count -C 3000 -W 1500
check_rabbitmq_prod:
#!/usr/bin/env python
from optparse import OptionParser
import shlex
import subprocess
import sys
class RabbitCmdWrapper(object):
"""So basically this just runs rabbitmqctl commands and returns parsed output.
Typically this means you need root privs for this to work.
Made this it's own class so it could be used in other monitoring tools
if desired."""
@classmethod
def list_queues(cls):
args = shlex.split('sudo rabbitmqctl list_queues')
cmd_result = subprocess.check_output(args).strip()
results = cls._parse_list_results(cmd_result)
return results
@classmethod
def _parse_list_results(cls, result_string):
results = result_string.strip().split('\n')
#remove text fluff
results.remove(results[-1])
results.remove(results[0])
return_data = []
for row in results:
return_data.append(row.split('\t'))
return return_data
def check_queues_count(critical=1000, warning=1000):
"""
A blanket check to make sure all queues are within count parameters.
TODO: Possibly break this out so test can be done on individual queues.
"""
try:
critical_q = []
warning_q = []
ok_q = []
results = RabbitCmdWrapper.list_queues()
for queue in results:
if queue[0] == 'SFS_Production_Queue':
count = int(queue[1])
if count >= critical:
critical_q.append("%s: %s" % (queue[0], count))
elif count >= warning:
warning_q.append("%s: %s" % (queue[0], count))
else:
ok_q.append("%s: %s" % (queue[0], count))
if critical_q:
print "CRITICAL - %s" % ", ".join(critical_q)
sys.exit(2)
elif warning_q:
print "WARNING - %s" % ", ".join(warning_q)
sys.exit(1)
else:
print "OK - %s" % ", ".join(ok_q)
sys.exit(0)
except Exception, err:
print "CRITICAL - %s" % err
sys.exit(2)
USAGE = """Usage: ./check_rabbitmq -a [action] -C [critical] -W [warning]
Actions:
- queues_count
checks the count in each of the queues in rabbitmq's list_queues"""
if __name__ == "__main__":
parser = OptionParser(USAGE)
parser.add_option("-a", "--action", dest="action",
help="Action to Check")
parser.add_option("-C", "--critical", dest="critical",
type="int", help="Critical Threshold")
parser.add_option("-W", "--warning", dest="warning",
type="int", help="Warning Threshold")
(options, args) = parser.parse_args()
if options.action == "queues_count":
check_queues_count(options.critical, options.warning)
else:
print "Invalid action: %s" % options.action
print USAGE
此时我不确定是什么阻止脚本 运行ning。通过命令行 运行 没问题。感谢任何帮助。
"non-zero exit code" 错误通常与在您的 sudoers 文件中默认应用于所有用户的 requiretty
有关。
在运行检查的用户的 sudoers 文件中禁用 "requiretty" 是安全的,并且可能会解决问题。
例如(假设 nagios/nrpe 是用户)
@/etc/sudoers
Defaults:nagios !requiretty
Defaults:nrpe !requiretty
我猜@EE1213先生说的是对的。如果您有权查看 /var/log/secure,日志可能包含有关 sudoers 的错误消息。喜欢:
"sorry, you must have a tty to run sudo"
我正在尝试 运行 一个 python 脚本使用 nrpe 来监控 rabbitmq。脚本内部是一个命令 'sudo rabbiqmqctl list_queues',它为我提供了每个队列的消息计数。然而,这导致 nagios 给出 htis 消息:
CRITICAL - Command '['sudo', 'rabbitmqctl', 'list_queues']' returned non-zero exit status 1
我认为这可能是权限问题,所以按以下方式进行
/etc/group:
ec2-user:x:500:
rabbitmq:x:498:nrpe,nagios,ec2-user
nagios:x:497:
nrpe:x:496:
rpc:x:32:
/etc/sudoers:
%rabbitmq ALL=NOPASSWD: /usr/sbin/rabbitmqctl
nagios 配置:
command[check_rabbitmq_queuecount_prod]=/usr/bin/python27 /etc/nagios/check_rabbitmq_prod -a queues_count -C 3000 -W 1500
check_rabbitmq_prod:
#!/usr/bin/env python
from optparse import OptionParser
import shlex
import subprocess
import sys
class RabbitCmdWrapper(object):
"""So basically this just runs rabbitmqctl commands and returns parsed output.
Typically this means you need root privs for this to work.
Made this it's own class so it could be used in other monitoring tools
if desired."""
@classmethod
def list_queues(cls):
args = shlex.split('sudo rabbitmqctl list_queues')
cmd_result = subprocess.check_output(args).strip()
results = cls._parse_list_results(cmd_result)
return results
@classmethod
def _parse_list_results(cls, result_string):
results = result_string.strip().split('\n')
#remove text fluff
results.remove(results[-1])
results.remove(results[0])
return_data = []
for row in results:
return_data.append(row.split('\t'))
return return_data
def check_queues_count(critical=1000, warning=1000):
"""
A blanket check to make sure all queues are within count parameters.
TODO: Possibly break this out so test can be done on individual queues.
"""
try:
critical_q = []
warning_q = []
ok_q = []
results = RabbitCmdWrapper.list_queues()
for queue in results:
if queue[0] == 'SFS_Production_Queue':
count = int(queue[1])
if count >= critical:
critical_q.append("%s: %s" % (queue[0], count))
elif count >= warning:
warning_q.append("%s: %s" % (queue[0], count))
else:
ok_q.append("%s: %s" % (queue[0], count))
if critical_q:
print "CRITICAL - %s" % ", ".join(critical_q)
sys.exit(2)
elif warning_q:
print "WARNING - %s" % ", ".join(warning_q)
sys.exit(1)
else:
print "OK - %s" % ", ".join(ok_q)
sys.exit(0)
except Exception, err:
print "CRITICAL - %s" % err
sys.exit(2)
USAGE = """Usage: ./check_rabbitmq -a [action] -C [critical] -W [warning]
Actions:
- queues_count
checks the count in each of the queues in rabbitmq's list_queues"""
if __name__ == "__main__":
parser = OptionParser(USAGE)
parser.add_option("-a", "--action", dest="action",
help="Action to Check")
parser.add_option("-C", "--critical", dest="critical",
type="int", help="Critical Threshold")
parser.add_option("-W", "--warning", dest="warning",
type="int", help="Warning Threshold")
(options, args) = parser.parse_args()
if options.action == "queues_count":
check_queues_count(options.critical, options.warning)
else:
print "Invalid action: %s" % options.action
print USAGE
此时我不确定是什么阻止脚本 运行ning。通过命令行 运行 没问题。感谢任何帮助。
"non-zero exit code" 错误通常与在您的 sudoers 文件中默认应用于所有用户的 requiretty
有关。
在运行检查的用户的 sudoers 文件中禁用 "requiretty" 是安全的,并且可能会解决问题。
例如(假设 nagios/nrpe 是用户)
@/etc/sudoers
Defaults:nagios !requiretty
Defaults:nrpe !requiretty
我猜@EE1213先生说的是对的。如果您有权查看 /var/log/secure,日志可能包含有关 sudoers 的错误消息。喜欢:
"sorry, you must have a tty to run sudo"