delayed_job 上帝监控 - 重启后重复进程
delayed_job monitored by God - duplicate processes after restart
我正在监视 delayed_job using God。这是我的上帝配置文件。
QUEUE = "slow"
WORKERS = 14
WORKERS.times do |num|
God.watch do |w|
w.name = "dj.#{num}"
w.group = "tanda"
w.uid = 'deployer'
w.gid = 'deployer'
w.start = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job start --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
w.restart = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job restart --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
w.stop = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job stop -i #{num}"
w.start_grace = 30.seconds
w.restart_grace = 30.seconds
w.stop_grace = 30.seconds
w.pid_file = "#{RAILS_ROOT}/tmp/pids/delayed_job.#{num}.pid"
w.log = "#{RAILS_ROOT}/log/dj.#{num}.log"
w.err_log = "#{RAILS_ROOT}/log/dj.#{num}.errors.log"
w.behavior(:clean_pid_file)
w.interval = 30.seconds
w.dir = File.expand_path('.')
w.env = {
"RACK_ENV" => RAILS_ENV,
"RAILS_ENV" => RAILS_ENV,
"CURRENT_DIRECTORY" => RAILS_ROOT
}
w.start_if do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
end
end
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 10
c.within = 3.minutes
c.transition = :unmonitored
c.retry_in = 10.minutes
end
end
end
end
然后我在每次部署时使用 Capistrano 2 重新启动这些进程:
run("cd #{current_path} && rvmsudo god restart tanda")
当我启动 God 时,我的 ps
输出看起来像这样。
s -e -www -o pid,rss,command | grep delayed
31960 220804 delayed_job.0
31966 220152 delayed_job.8
31973 226012 delayed_job.9
31979 215176 delayed_job.1
31984 210260 delayed_job.13
31994 240424 delayed_job.3
31997 225248 delayed_job.11
32003 196364 delayed_job.5
32009 236192 delayed_job.6
32015 214540 delayed_job.12
32022 247096 delayed_job.4
32029 206352 delayed_job.2
32047 232748 delayed_job.7
32061 228128 delayed_job.10
如果我立即重新启动 Capistrano,而不进行部署或其他任何操作,那么一分钟后它看起来像这样。
ps -e -www -o pid,rss,command | grep delayed
9884 198076 delayed_job.10
9895 195372 delayed_job.0
9919 196856 delayed_job.6
9948 196772 delayed_job.5
9964 196568 delayed_job.9
9973 194092 delayed_job.12
9982 195648 delayed_job.13
9997 196392 delayed_job.2
10005 195356 delayed_job.4
10016 197268 delayed_job.3
10032 198820 delayed_job.8
10054 194316 delayed_job.7
10078 196780 delayed_job.11
10127 202420 delayed_job.1
10133 197468 delayed_job.1
10145 194040 delayed_job.1
10158 195760 delayed_job.1
10173 195844 delayed_job.1
又一次重启后:
ps -e -www -o pid,rss,command | grep delayed
9884 221780 delayed_job.10
9973 225100 delayed_job.12
9982 224708 delayed_job.13
10078 235076 delayed_job.11
21467 187056 delayed_job.0
21483 187844 delayed_job.7
21497 189648 delayed_job.10
21509 187316 delayed_job.2
21518 188180 delayed_job.11
21527 187968 delayed_job.3
21542 187852 delayed_job.12
21546 186900 delayed_job.13
21556 188628 delayed_job.5
21565 187816 delayed_job.9
21574 185216 delayed_job.4
21585 188088 delayed_job.1
21599 188556 delayed_job.1
21602 188400 delayed_job.1
21615 193484 delayed_job.1
21628 193288 delayed_job.8
21632 188228 delayed_job.1
21643 187804 delayed_job.6
如您所见,这些重复的进程有时有新的 pid(例如,从第一个转储到第二个)但有时没有(例如,DJ 10 从第二个到第三个)。
我真的不知道从哪里开始调试它。上帝在重新启动时没有给出任何错误,DJ 日志在启动进程时只显示通常的输出。同样的事情不会发生在一个只有 4 个工作人员的小型服务器上 运行(但在其他方面是相同的)。
有人以前看过这个吗?
我认为这一定是 daemons
gem 中的一个问题 delayed_job
作业用于在后台工作,因为将它添加到我的上帝文件的顶部似乎有固定的东西:
ids = ('a'..'z').to_a
workers.times do |num|
num = ids[num]
似乎有一个问题,名为 delayed_job.1
和 delayed_job.11
(等)的进程会发生冲突,这会导致很多问题。我还没有真正将它隔离得太远,但是将它更改为不同的命名约定(delayed_job.a
在这种情况下)现在对我来说已经解决了问题。
如果有人有更好的solution/a理由来解释为什么它有效,将保持开放状态。
我正在监视 delayed_job using God。这是我的上帝配置文件。
QUEUE = "slow"
WORKERS = 14
WORKERS.times do |num|
God.watch do |w|
w.name = "dj.#{num}"
w.group = "tanda"
w.uid = 'deployer'
w.gid = 'deployer'
w.start = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job start --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
w.restart = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job restart --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
w.stop = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job stop -i #{num}"
w.start_grace = 30.seconds
w.restart_grace = 30.seconds
w.stop_grace = 30.seconds
w.pid_file = "#{RAILS_ROOT}/tmp/pids/delayed_job.#{num}.pid"
w.log = "#{RAILS_ROOT}/log/dj.#{num}.log"
w.err_log = "#{RAILS_ROOT}/log/dj.#{num}.errors.log"
w.behavior(:clean_pid_file)
w.interval = 30.seconds
w.dir = File.expand_path('.')
w.env = {
"RACK_ENV" => RAILS_ENV,
"RAILS_ENV" => RAILS_ENV,
"CURRENT_DIRECTORY" => RAILS_ROOT
}
w.start_if do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
end
end
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 10
c.within = 3.minutes
c.transition = :unmonitored
c.retry_in = 10.minutes
end
end
end
end
然后我在每次部署时使用 Capistrano 2 重新启动这些进程:
run("cd #{current_path} && rvmsudo god restart tanda")
当我启动 God 时,我的 ps
输出看起来像这样。
s -e -www -o pid,rss,command | grep delayed
31960 220804 delayed_job.0
31966 220152 delayed_job.8
31973 226012 delayed_job.9
31979 215176 delayed_job.1
31984 210260 delayed_job.13
31994 240424 delayed_job.3
31997 225248 delayed_job.11
32003 196364 delayed_job.5
32009 236192 delayed_job.6
32015 214540 delayed_job.12
32022 247096 delayed_job.4
32029 206352 delayed_job.2
32047 232748 delayed_job.7
32061 228128 delayed_job.10
如果我立即重新启动 Capistrano,而不进行部署或其他任何操作,那么一分钟后它看起来像这样。
ps -e -www -o pid,rss,command | grep delayed
9884 198076 delayed_job.10
9895 195372 delayed_job.0
9919 196856 delayed_job.6
9948 196772 delayed_job.5
9964 196568 delayed_job.9
9973 194092 delayed_job.12
9982 195648 delayed_job.13
9997 196392 delayed_job.2
10005 195356 delayed_job.4
10016 197268 delayed_job.3
10032 198820 delayed_job.8
10054 194316 delayed_job.7
10078 196780 delayed_job.11
10127 202420 delayed_job.1
10133 197468 delayed_job.1
10145 194040 delayed_job.1
10158 195760 delayed_job.1
10173 195844 delayed_job.1
又一次重启后:
ps -e -www -o pid,rss,command | grep delayed
9884 221780 delayed_job.10
9973 225100 delayed_job.12
9982 224708 delayed_job.13
10078 235076 delayed_job.11
21467 187056 delayed_job.0
21483 187844 delayed_job.7
21497 189648 delayed_job.10
21509 187316 delayed_job.2
21518 188180 delayed_job.11
21527 187968 delayed_job.3
21542 187852 delayed_job.12
21546 186900 delayed_job.13
21556 188628 delayed_job.5
21565 187816 delayed_job.9
21574 185216 delayed_job.4
21585 188088 delayed_job.1
21599 188556 delayed_job.1
21602 188400 delayed_job.1
21615 193484 delayed_job.1
21628 193288 delayed_job.8
21632 188228 delayed_job.1
21643 187804 delayed_job.6
如您所见,这些重复的进程有时有新的 pid(例如,从第一个转储到第二个)但有时没有(例如,DJ 10 从第二个到第三个)。
我真的不知道从哪里开始调试它。上帝在重新启动时没有给出任何错误,DJ 日志在启动进程时只显示通常的输出。同样的事情不会发生在一个只有 4 个工作人员的小型服务器上 运行(但在其他方面是相同的)。
有人以前看过这个吗?
我认为这一定是 daemons
gem 中的一个问题 delayed_job
作业用于在后台工作,因为将它添加到我的上帝文件的顶部似乎有固定的东西:
ids = ('a'..'z').to_a
workers.times do |num|
num = ids[num]
似乎有一个问题,名为 delayed_job.1
和 delayed_job.11
(等)的进程会发生冲突,这会导致很多问题。我还没有真正将它隔离得太远,但是将它更改为不同的命名约定(delayed_job.a
在这种情况下)现在对我来说已经解决了问题。
如果有人有更好的solution/a理由来解释为什么它有效,将保持开放状态。