Live restarts of a supervised unicorn process
almost 11 years ago
We have all seen this dreaded screen before.
In the Rails case this usually happens during application restarts.
While Discourse is rapidly evolving we are heavily encouraging users to upgrade frequently, even weekly. If your site is regularly erroring out, users very quickly lose confidence. In the ideal case you want zero downtime deploys. This feature heavily encourages users to deploy more rapidly.
Unicorn has built-in support for live restarts, however getting this to play well with a supervisor like say runit is not easy. Underlying pids are changing and stuff gets complicated fast.
To tackle this I decided to create a simple bash script that acts as a mini-supervisor for unicorn.
However, before any of this I needed some sane way of measuring how well I did.
###Measuring uptime during a live restart
Traditionally you would use apache bench for quick and dirty testing, however it did not fare well for me. Unfortunately ab has no way of “throttling” the amount of requests it sends out. To measure uptime we need to perform a request to the site every N millisecond.
I ended up knocking up a quick and dirty apache bench clone that allows me to trickle through requests:
require "optparse"
require "uri"
require "net/http"
duration = 10
per_second = 10
opts = OptionParser.new do |opts|
opts.banner = "Usage: bench_web [options] url"
opts.on("-t", "--time TIME", OptionParser::DecimalInteger, "Duration to run the test in seconds (default 10)") do |t|
duration = t
end
opts.on("-p", "--per-second REQUESTS", OptionParser::DecimalInteger, "Max number of requests per second (default 10)") do |t|
per_second = t.to_f
end
end
opts.parse!
if ARGV.length != 1
puts opts.banner
puts
exit(1)
end
uri = begin
URI(ARGV[0])
rescue
puts opt.banner
puts
puts "Invalid URL"
puts
exit(1)
end
GC.disable
finish_time = Time.now + duration
results = []
while (start=Time.now) < finish_time
res = Net::HTTP.get_response(uri)
req_duration = Time.now - start
results << {duration: req_duration, code: res.code, length: res.body.length}
GC.enable
GC.start
GC.disable
padding = (1 / per_second.to_f) - (Time.now - start)
if padding > 0
sleep padding
end
putc "."
end
GC.enable
puts
puts "Results"
puts "Total duration: #{duration} second#{duration==1?"":"s"}"
puts "Total requests: #{results.length}"
summary = results.group_by{|r| r[:code]}.map{|code, array| [code, array.count]}.sort{|a,b| a[1] <=> b[1]}
failures = summary.map{|code, count| code == "200" ? 0 : count}.inject(:+)
if failures > 0
puts "Estimated downtime: #{((failures.to_f * (1.to_f / per_second)) * 1000).to_i}ms"
end
puts
puts "By status code: #{summary.map{|code,count| "[#{code}]x#{count} "}.join}"
puts ""
puts "Percentage of the successful requests served within a certain time (ms)"
good_requests = results.find_all{|r| r[:code] == "200"}.map{|r| r[:duration]}.sort
if good_requests.length > 0
[25,50,66,75,80,90,95,98,99,100].map{ |percentile|
time = good_requests[((percentile.to_f / 100.0) * (good_requests.length-1)).to_i]
puts " #{percentile}%\t\t#{(time * 1000).to_i}"
}
end
For example, if I run it against a site that is restarting without any fancy help I can see:
$ ruby ./bench_web.rb -t 10 http://l.discourse/
Total duration: 10 seconds
Total requests: 82
Estimated downtime: 3700ms
By status code: [502]x37 [200]x45
Percentage of the successful requests served within a certain time (ms)
25% 16
50% 18
66% 19
75% 20
80% 20
90% 21
95% 21
98% 58
99% 58
100% 1854
Not too good, that is 3.7 seconds of downtime while flipping this process.
###Supervising unicorns
The standard way to do live restarts with unicorn (assuming you are preloading the app) is to send a USR2 signal to the master process, wait for it to launch a new master and the send a TERM to the old master. However, this plays really badly with supervisors that need pids not to change.
To work around this I created a simple bash file that acts as a proxy. It has a stable pid and takes care of signalling and restarting the unicorn it is running. Send it a USR2 and it will initiate the process.
#!/bin/bash
# This is a helper script you can use to supervise unicorn, it allows you to perform a live restart
# by sending it a USR2 signal
LOCAL_WEB="http://127.0.0.1:3000/"
function on_exit()
{
kill $UNICORN_PID
echo "exiting"
}
function on_reload()
{
echo "Reloading unicorn"
kill -s USR2 $UNICORN_PID
sleep 10
curl $LOCAL_WEB &> /dev/null
NEW_UNICORN_PID=`ps -f --ppid $UNICORN_PID | grep unicorn | grep -v worker | awk '{ print $2 }'`
kill $UNICORN_PID
echo "Old pid is: $UNICORN_PID New pid is: $NEW_UNICORN_PID"
UNICORN_PID=$NEW_UNICORN_PID
}
export UNICORN_SUPERVISOR_PID=$$
trap on_exit EXIT
trap on_reload USR2
unicorn -c $1 &
UNICORN_PID=$!
echo "supervisor pid: $UNICORN_SUPERVISOR_PID unicorn pid: $UNICORN_PID"
while [ -e /proc/$UNICORN_PID ]
do
sleep 0.1
done
Then I can run the following at any point in time to perform a coordinated live restart using
kill -s USR2 <pid>
Additionally the script will stop the unicorn it is supervising if it is killed or exited.
###Added bonus, suicide channel
The script passes in the supervisor pid to the unicorn process. At this point the unicorn master can check that its supervisor is running regularly and terminate itself if for some reason somebody ran kill -9
on the supervisor script.
#unicorn conf
before_fork do |server, worker|
unless initialized
initialized = true
supervisor = ENV['UNICORN_SUPERVISOR_PID'].to_i
if supervisor > 0
Thread.new do
while true
unless File.exists?("/proc/#{supervisor}")
puts "Kill self supervisor is gone"
Process.kill "TERM", Process.pid
end
sleep 2
end
end
end
end
end
###Results
The results of this method are quite fantastic, zero downtime during live restarts:
Total duration: 40 seconds
Total requests: 396
By status code: [200]x396
Percentage of the successful requests served within a certain time (ms)
25% 6
50% 16
66% 17
75% 18
80% 18
90% 19
95% 20
98% 23
99% 40
100% 136
I will be rolling this into my docker image, for added robustness.
I was working on a node.js app a few days ago to measure this sort if thing:
I’ve been looking at how different load balancer algorithms and configurations handle deployments and application restarts.
It’s pretty basic but it’s been helpful to see the data on a chart.