We have all seen this dreaded screen before.

In the Rails case this usually happens during application restarts.

While Discourse is rapidly evolving we are heavily encouraging users to upgrade frequently, even weekly. If your site is regularly erroring out, users very quickly lose confidence. In the ideal case you want zero downtime deploys. This feature heavily encourages users to deploy more rapidly.

Unicorn has built-in support for live restarts, however getting this to play well with a supervisor like say runit is not easy. Underlying pids are changing and stuff gets complicated fast.

To tackle this I decided to create a simple bash script that acts as a mini-supervisor for unicorn.

However, before any of this I needed some sane way of measuring how well I did.

Measuring uptime during a live restart

Traditionally you would use apache bench for quick and dirty testing, however it did not fare well for me. Unfortunately ab has no way of "throttling" the amount of requests it sends out. To measure uptime we need to perform a request to the site every N millisecond.

I ended up knocking up a quick and dirty apache bench clone that allows me to trickle through requests:

require "optparse"
require "uri"
require "net/http"

duration = 10
per_second = 10

opts = OptionParser.new do |opts|
  opts.banner = "Usage: bench_web [options] url"

  opts.on("-t", "--time TIME", OptionParser::DecimalInteger, "Duration to run the test in seconds (default 10)") do |t|
    duration = t
  end

  opts.on("-p", "--per-second REQUESTS", OptionParser::DecimalInteger, "Max number of requests per second (default 10)") do |t|
    per_second = t.to_f
  end

end

opts.parse!

if ARGV.length != 1
 puts opts.banner
 puts
 exit(1)
end


uri = begin
        URI(ARGV[0])
      rescue
        puts opt.banner
        puts
        puts "Invalid URL"
        puts
        exit(1)
      end

GC.disable

finish_time = Time.now + duration
results = []
while (start=Time.now) < finish_time
  res = Net::HTTP.get_response(uri)
  req_duration = Time.now - start
  results << {duration: req_duration, code: res.code, length: res.body.length}

  GC.enable
  GC.start
  GC.disable

  padding = (1 / per_second.to_f) - (Time.now - start)
  if padding > 0
    sleep padding
  end
  putc "."
end

GC.enable

puts
puts "Results"
puts "Total duration: #{duration} second#{duration==1?"":"s"}"
puts "Total requests: #{results.length}"

summary = results.group_by{|r| r[:code]}.map{|code, array| [code, array.count]}.sort{|a,b| a[1] <=> b[1]}

failures = summary.map{|code, count| code == "200" ? 0 : count}.inject(:+)

if failures > 0
 puts "Estimated downtime: #{((failures.to_f * (1.to_f / per_second)) * 1000).to_i}ms"
end

puts
puts "By status code: #{summary.map{|code,count| "[#{code}]x#{count} "}.join}"

puts ""

puts "Percentage of the successful requests served within a certain time (ms)"

good_requests = results.find_all{|r| r[:code] == "200"}.map{|r| r[:duration]}.sort

if good_requests.length > 0
  [25,50,66,75,80,90,95,98,99,100].map{ |percentile|
    time = good_requests[((percentile.to_f / 100.0) * (good_requests.length-1)).to_i]
    puts "  #{percentile}%\t\t#{(time * 1000).to_i}"
  }
end

For example, if I run it against a site that is restarting without any fancy help I can see:

$ ruby ./bench_web.rb -t 10 http://l.discourse/
Total duration: 10 seconds
Total requests: 82
Estimated downtime: 3700ms

By status code: [502]x37 [200]x45 

Percentage of the successful requests served within a certain time (ms)
  25%		16
  50%		18
  66%		19
  75%		20
  80%		20
  90%		21
  95%		21
  98%		58
  99%		58
  100%		1854

Not too good, that is 3.7 seconds of downtime while flipping this process.

Supervising unicorns

The standard way to do live restarts with unicorn (assuming you are preloading the app) is to send a USR2 signal to the master process, wait for it to launch a new master and the send a TERM to the old master. However, this plays really badly with supervisors that need pids not to change.

To work around this I created a simple bash file that acts as a proxy. It has a stable pid and takes care of signalling and restarting the unicorn it is running. Send it a USR2 and it will initiate the process.

#!/bin/bash

# This is a helper script you can use to supervise unicorn, it allows you to perform a live restart
# by sending it a USR2 signal

LOCAL_WEB="http://127.0.0.1:3000/"

function on_exit()
{
  kill $UNICORN_PID
  echo "exiting"
}

function on_reload()
{
  echo "Reloading unicorn"
  kill -s USR2 $UNICORN_PID
  sleep 10
  curl $LOCAL_WEB &> /dev/null
  NEW_UNICORN_PID=`ps -f --ppid $UNICORN_PID | grep unicorn | grep -v worker | awk '{ print $2 }'`
  kill $UNICORN_PID
  echo "Old pid is: $UNICORN_PID New pid is: $NEW_UNICORN_PID"
  UNICORN_PID=$NEW_UNICORN_PID
}

export UNICORN_SUPERVISOR_PID=$$

trap on_exit EXIT
trap on_reload USR2

unicorn -c $1 &
UNICORN_PID=$!

echo "supervisor pid: $UNICORN_SUPERVISOR_PID unicorn pid: $UNICORN_PID"

while [ -e /proc/$UNICORN_PID ]
do
  sleep 0.1
done

Then I can run the following at any point in time to perform a coordinated live restart using

kill -s USR2 <pid>

Additionally the script will stop the unicorn it is supervising if it is killed or exited.

Added bonus, suicide channel

The script passes in the supervisor pid to the unicorn process. At this point the unicorn master can check that its supervisor is running regularly and terminate itself if for some reason somebody ran kill -9 on the supervisor script.

#unicorn conf
before_fork do |server, worker|

  unless initialized

    initialized = true

    supervisor = ENV['UNICORN_SUPERVISOR_PID'].to_i
    if supervisor > 0
      Thread.new do
        while true
          unless File.exists?("/proc/#{supervisor}")
            puts "Kill self supervisor is gone"
            Process.kill "TERM", Process.pid
          end
          sleep 2
        end
      end
    end

  end
end

Results

The results of this method are quite fantastic, zero downtime during live restarts:

Total duration: 40 seconds
Total requests: 396

By status code: [200]x396 

Percentage of the successful requests served within a certain time (ms)
  25%		6
  50%		16
  66%		17
  75%		18
  80%		18
  90%		19
  95%		20
  98%		23
  99%		40
  100%		136

I will be rolling this into my docker image, for added robustness.

Comments

Adam Flanagan 8 months ago
Adam Flanagan

I was working on a node.js app a few days ago to measure this sort if thing:

https://github.com/adamflanagan/requests

I've been looking at how different load balancer algorithms and configurations handle deployments and application restarts.

It's pretty basic but it's been helpful to see the data on a chart.

Sam Saffron 8 months ago
Sam Saffron

Sounds pretty cool, do you have any screenshots (you can drag drop into here smile )

Agree on charts, love visualizing data.

Sam Saffron 8 months ago
Sam Saffron

David Cramer over on Twitter was wondering why go this route when the accepted general solution is to simply run a load balancer.

You have nginx or haproxy balance two sockets or ports, on upgrade you take one down, upgrade it and then repeat for the other.

I am totally for these kind of setups but they do add complexity elsewhere,

You now need 2 runit services for your webs, a big question mark is on if or if not you start one of them in a "disabled state", Rails being so hungry memory wise means you need to be extremely frugal. Keep in mind our unicorn master is already forking out a few workers to stay memory efficient. Running multiple master processes permanently throws away some memory that can no longer be shared.

Further more, the restart process remains tricky, disable one service, upgrade, enable, wait, repeat.

There is a description of such a setup here, it is a great way to have stuff setup.

Personally, I just find the bash script here a bit simpler for very simple setups and it performs great. I love that I can simple send a kill -s USR2 <pid> to the bash script for a clean rolling restart.

Adam Flanagan 8 months ago
Adam Flanagan

Yep, here's one I've just run. At 11:04:30 I restarted the IIS app pool and at 11:05:30 I did a full restart of IIS. On both occassions there appear to be a 2-3 requests that wait for the restart to finish (~25 seconds) but the LB switches it over to the second server pretty quick. With the IIS restart there are also a couple of 500s.

Joshua Sierles 69 days ago
Joshua Sierles

Have you used this script successfully with the runsv control commands?

Sam Saffron 69 days ago
Sam Saffron

confirmed, it is buggy, will fix on Discourse.

Joshua Sierles 69 days ago
Joshua Sierles

Cool - I found the 'reload' command sends the HUP signal, so trapping that
works.

Joshua Sierles 61 days ago
Joshua Sierles

I'll take a look. Ideally you'd trap all the signals relevant to Unicorn: http://unicorn.bogomips.org/SIGNALS.html. But perhaps just adding HUP for now is better than nothing.


comments powered by Discourse