supervisor 一半时间重启失败
Supervisor fails to restart half of the time
我正在尝试在机器上使用 Uwsgi 和主管部署 Django 应用程序 运行 Debian 8.1。
当我通过 sudo systemctl restart supervisor
重新启动时,有一半时间无法重新启动。
$ root@host:/# systemctl start supervisor
Job for supervisor.service failed. See 'systemctl status supervisor.service' and 'journalctl -xn' for details.
$ root@host:/# systemctl status supervisor.service
● supervisor.service - LSB: Start/stop supervisor
Loaded: loaded (/etc/init.d/supervisor)
Active: failed (Result: exit-code) since Wed 2015-09-23 11:12:01 UTC; 16s ago
Process: 21505 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
Process: 21511 ExecStart=/etc/init.d/supervisor start (code=exited, status=1/FAILURE)
Sep 23 11:12:01 host supervisor[21511]: Starting supervisor:
Sep 23 11:12:01 host systemd[1]: supervisor.service: control process exited, code=exited status=1
Sep 23 11:12:01 host systemd[1]: Failed to start LSB: Start/stop supervisor.
Sep 23 11:12:01 host systemd[1]: Unit supervisor.service entered failed state.
但是主管或 uwsgi 日志中没有任何内容。
Supervisor 3.0 是运行 uwsgi 的这个配置:
[program:uwsgi]
stopsignal=QUIT
command = uwsgi --ini uwsgi.ini
directory = /dir/
environment=ENVIRONMENT=STAGING
logfile-maxbytes = 300MB
stopsignal=QUIT 已被添加,因为 UWSGI 在停止时忽略默认信号 (SIGTERM) 并被 SIGKILL 残忍地杀死,留下孤儿工人。
有什么方法可以让我调查正在发生的事情吗?
编辑:
按照 mnencia 的建议尝试过:/etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
但它仍然有一半失败。
root@host:~# /etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
[ ok ] Stopping supervisor (via systemctl): supervisor.service.
● supervisor.service - LSB: Start/stop supervisor
Loaded: loaded (/etc/init.d/supervisor)
Active: inactive (dead) since Tue 2015-11-24 13:04:32 UTC; 89ms ago
Process: 23490 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
Process: 23349 ExecStart=/etc/init.d/supervisor start (code=exited, status=0/SUCCESS)
Nov 24 13:04:30 xxx supervisor[23349]: Starting supervisor: supervisord.
Nov 24 13:04:30 xxx systemd[1]: Started LSB: Start/stop supervisor.
Nov 24 13:04:32 xxx systemd[1]: Stopping LSB: Start/stop supervisor...
Nov 24 13:04:32 xxx supervisor[23490]: Stopping supervisor: supervisord.
Nov 24 13:04:32 xxx systemd[1]: Stopped LSB: Start/stop supervisor.
[....] Starting supervisor (via systemctl): supervisor.serviceJob for supervisor.service failed. See 'systemctl status supervisor.service' and 'journalctl -xn' for details.
failed!
root@host:~# /etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
[ ok ] Stopping supervisor (via systemctl): supervisor.service.
● supervisor.service - LSB: Start/stop supervisor
Loaded: loaded (/etc/init.d/supervisor)
Active: failed (Result: exit-code) since Tue 2015-11-24 13:04:32 UTC; 1s ago
Process: 23490 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
Process: 23526 ExecStart=/etc/init.d/supervisor start (code=exited, status=1/FAILURE)
Nov 24 13:04:32 xxx systemd[1]: supervisor.service: control process exited, code=exited status=1
Nov 24 13:04:32 xxx systemd[1]: Failed to start LSB: Start/stop supervisor.
Nov 24 13:04:32 xxx systemd[1]: Unit supervisor.service entered failed state.
Nov 24 13:04:32 xxx supervisor[23526]: Starting supervisor:
Nov 24 13:04:33 xxx systemd[1]: Stopped LSB: Start/stop supervisor.
[ ok ] Starting supervisor (via systemctl): supervisor.service.
这不一定是主管的错误。我从您的 systemctl status
输出中看到 supervisor
是通过 sysv-init 兼容层启动的,因此故障可能在 /etc/init.d/supervisor
脚本中。它将解释 supervisord 日志中没有错误。
要调试 init 脚本,最简单的方法是在该文件中添加一个 set -x
作为第一个非注释指令,然后在 journalctl
输出中查看脚本执行的轨迹。
编辑:
我已经在带有 Debian Sid 的测试系统上复制并调试了它。
问题是主管初始化脚本的 stop 目标不检查守护进程是否真的终止,但仅在进程存在时发送信号。如果守护进程需要一段时间才能关闭,则后续的 start 操作将因守护进程死亡而失败,这被视为已经 运行.
我在 Debian Bug Tracker 上打开了一个错误:http://bugs.debian.org/805920
解决方法:
您可以通过以下方式解决此问题:
/etc/init.d/supervisor force-stop && \
/etc/init.d/supervisor stop && \
/etc/init.d/supervisor start
force-stop
将确保 supervisord 已终止(在 systemd 之外)。
stop
确保 systemd 知道它已终止
start
重新开始
force-stop
之后的 stop
是必需的,否则 systemd 将忽略任何后续的 start
请求。 stop
和 start
可以使用 restart
组合,但在这里我将它们都放在一起以显示其工作原理。
我在 ubuntu 14.04 中遇到了这个问题,尝试了来自 debian 和 @mnencia 解决方案的最新 initd 脚本,但它们对我不起作用。强制停止解决方案并没有杀死程序进程,它们只是在 supervisord 被杀死后 运行 被保留。
我的解决方案是修补 supervisord 并启动和重新启动部分 initd 脚本代码我不想猜测一个好的 DODTIME,我希望它在旧的 supervisor master 进程死后立即启动,所以我添加了重试逻辑。请注意,它有点冗长,但如果您不喜欢这种行为,您可以删除回声调用,并且可以更改最大重复次数(此处设置为 20)。
start)
echo -n "Starting $DESC: "
i=1
until [ $i -ge 21 ]; do
start-stop-daemon --start --quiet --pidfile $PIDFILE --startas $DAEMON -- $DAEMON_OPTS && break
echo -n -e "\nAlready running, old process still finishing? retrying ($i/20)..."
let "i += 1"
sleep 1
done
sleep 1
if running ; then
echo "$NAME."
else
echo " ERROR."
fi
;;
restart)
echo -n "Restarting $DESC: "
start-stop-daemon --stop --quiet --oknodo --pidfile $PIDFILE
i=1
until [ $i -ge 21 ]; do
start-stop-daemon --start --quiet --pidfile $PIDFILE --startas $DAEMON -- $DAEMON_OPTS && break
echo -n -e "\nAlready running, old process still finishing? retrying ($i/20)..."
let "i += 1"
sleep 1
done
echo "$NAME."
;;
我还更改了 hashbang(第一行),所以 bash 代替了 sh,我想使用 let
#! /bin/bash
我正在尝试在机器上使用 Uwsgi 和主管部署 Django 应用程序 运行 Debian 8.1。
当我通过 sudo systemctl restart supervisor
重新启动时,有一半时间无法重新启动。
$ root@host:/# systemctl start supervisor
Job for supervisor.service failed. See 'systemctl status supervisor.service' and 'journalctl -xn' for details.
$ root@host:/# systemctl status supervisor.service
● supervisor.service - LSB: Start/stop supervisor
Loaded: loaded (/etc/init.d/supervisor)
Active: failed (Result: exit-code) since Wed 2015-09-23 11:12:01 UTC; 16s ago
Process: 21505 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
Process: 21511 ExecStart=/etc/init.d/supervisor start (code=exited, status=1/FAILURE)
Sep 23 11:12:01 host supervisor[21511]: Starting supervisor:
Sep 23 11:12:01 host systemd[1]: supervisor.service: control process exited, code=exited status=1
Sep 23 11:12:01 host systemd[1]: Failed to start LSB: Start/stop supervisor.
Sep 23 11:12:01 host systemd[1]: Unit supervisor.service entered failed state.
但是主管或 uwsgi 日志中没有任何内容。 Supervisor 3.0 是运行 uwsgi 的这个配置:
[program:uwsgi]
stopsignal=QUIT
command = uwsgi --ini uwsgi.ini
directory = /dir/
environment=ENVIRONMENT=STAGING
logfile-maxbytes = 300MB
stopsignal=QUIT 已被添加,因为 UWSGI 在停止时忽略默认信号 (SIGTERM) 并被 SIGKILL 残忍地杀死,留下孤儿工人。
有什么方法可以让我调查正在发生的事情吗?
编辑:
按照 mnencia 的建议尝试过:/etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
但它仍然有一半失败。
root@host:~# /etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
[ ok ] Stopping supervisor (via systemctl): supervisor.service.
● supervisor.service - LSB: Start/stop supervisor
Loaded: loaded (/etc/init.d/supervisor)
Active: inactive (dead) since Tue 2015-11-24 13:04:32 UTC; 89ms ago
Process: 23490 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
Process: 23349 ExecStart=/etc/init.d/supervisor start (code=exited, status=0/SUCCESS)
Nov 24 13:04:30 xxx supervisor[23349]: Starting supervisor: supervisord.
Nov 24 13:04:30 xxx systemd[1]: Started LSB: Start/stop supervisor.
Nov 24 13:04:32 xxx systemd[1]: Stopping LSB: Start/stop supervisor...
Nov 24 13:04:32 xxx supervisor[23490]: Stopping supervisor: supervisord.
Nov 24 13:04:32 xxx systemd[1]: Stopped LSB: Start/stop supervisor.
[....] Starting supervisor (via systemctl): supervisor.serviceJob for supervisor.service failed. See 'systemctl status supervisor.service' and 'journalctl -xn' for details.
failed!
root@host:~# /etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
[ ok ] Stopping supervisor (via systemctl): supervisor.service.
● supervisor.service - LSB: Start/stop supervisor
Loaded: loaded (/etc/init.d/supervisor)
Active: failed (Result: exit-code) since Tue 2015-11-24 13:04:32 UTC; 1s ago
Process: 23490 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
Process: 23526 ExecStart=/etc/init.d/supervisor start (code=exited, status=1/FAILURE)
Nov 24 13:04:32 xxx systemd[1]: supervisor.service: control process exited, code=exited status=1
Nov 24 13:04:32 xxx systemd[1]: Failed to start LSB: Start/stop supervisor.
Nov 24 13:04:32 xxx systemd[1]: Unit supervisor.service entered failed state.
Nov 24 13:04:32 xxx supervisor[23526]: Starting supervisor:
Nov 24 13:04:33 xxx systemd[1]: Stopped LSB: Start/stop supervisor.
[ ok ] Starting supervisor (via systemctl): supervisor.service.
这不一定是主管的错误。我从您的 systemctl status
输出中看到 supervisor
是通过 sysv-init 兼容层启动的,因此故障可能在 /etc/init.d/supervisor
脚本中。它将解释 supervisord 日志中没有错误。
要调试 init 脚本,最简单的方法是在该文件中添加一个 set -x
作为第一个非注释指令,然后在 journalctl
输出中查看脚本执行的轨迹。
编辑:
我已经在带有 Debian Sid 的测试系统上复制并调试了它。
问题是主管初始化脚本的 stop 目标不检查守护进程是否真的终止,但仅在进程存在时发送信号。如果守护进程需要一段时间才能关闭,则后续的 start 操作将因守护进程死亡而失败,这被视为已经 运行.
我在 Debian Bug Tracker 上打开了一个错误:http://bugs.debian.org/805920
解决方法:
您可以通过以下方式解决此问题:
/etc/init.d/supervisor force-stop && \
/etc/init.d/supervisor stop && \
/etc/init.d/supervisor start
force-stop
将确保 supervisord 已终止(在 systemd 之外)。stop
确保 systemd 知道它已终止start
重新开始
force-stop
之后的 stop
是必需的,否则 systemd 将忽略任何后续的 start
请求。 stop
和 start
可以使用 restart
组合,但在这里我将它们都放在一起以显示其工作原理。
我在 ubuntu 14.04 中遇到了这个问题,尝试了来自 debian 和 @mnencia 解决方案的最新 initd 脚本,但它们对我不起作用。强制停止解决方案并没有杀死程序进程,它们只是在 supervisord 被杀死后 运行 被保留。
我的解决方案是修补 supervisord 并启动和重新启动部分 initd 脚本代码我不想猜测一个好的 DODTIME,我希望它在旧的 supervisor master 进程死后立即启动,所以我添加了重试逻辑。请注意,它有点冗长,但如果您不喜欢这种行为,您可以删除回声调用,并且可以更改最大重复次数(此处设置为 20)。
start)
echo -n "Starting $DESC: "
i=1
until [ $i -ge 21 ]; do
start-stop-daemon --start --quiet --pidfile $PIDFILE --startas $DAEMON -- $DAEMON_OPTS && break
echo -n -e "\nAlready running, old process still finishing? retrying ($i/20)..."
let "i += 1"
sleep 1
done
sleep 1
if running ; then
echo "$NAME."
else
echo " ERROR."
fi
;;
restart)
echo -n "Restarting $DESC: "
start-stop-daemon --stop --quiet --oknodo --pidfile $PIDFILE
i=1
until [ $i -ge 21 ]; do
start-stop-daemon --start --quiet --pidfile $PIDFILE --startas $DAEMON -- $DAEMON_OPTS && break
echo -n -e "\nAlready running, old process still finishing? retrying ($i/20)..."
let "i += 1"
sleep 1
done
echo "$NAME."
;;
我还更改了 hashbang(第一行),所以 bash 代替了 sh,我想使用 let
#! /bin/bash