Zero uptime deploy
… and check why 5600+ Rails engineers read also this
Zero uptime deploy
Unicorn has a nice feature that bought it a lot of popularity and set standards
for other Ruby web servers: The ability to do Zero Downtime Deploy, also known
by the name rolling deploy or rolling restart aka hot restart. You start it by issuing
USR2
signal. But here is something that most websites won’t tell you. It can fail
and you won’t even notice. You will be thinking that everything went ok, living in Wonderland,
whereas in reality your deploy achieved uptime of exactly 0 seconds.
So what you need is a small verification procedure that everything worked as
expected. This article will demonstrate simple solution for achieving it
in case you are using capistrano
for deploying the app. However you can use very similar
procedure if you deploy your app with other tools.
Here is what we assume that you already have
deploy.rb
Nothing fancy here. As the documentation states:
USR2
signal for master process - reexecute the running binary. A separate
QUIT
should be sent to the original process once the child is verified to be up and running.
namespace :deploy do
task :reload, :roles => :app, :except => { :no_release => true } do
run "#{try_sudo} kill -s USR2 `cat #{unicorn_pid}`"
end
task :restart, :roles => :app, :except => { :no_release => true } do
reload
end
end
config/unicorn.rb
Whenever we spawn new child process we decrement the number of worker
processes by one with sending TTOU
signal to master process.
At the end we send QUIT
so the new master worker can take it place.
before_fork do |server, worker|
old_pid = "#{server.config[:pid]}.oldbin"
if old_pid != server.pid
begin
sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU
Process.kill(sig, File.read(old_pid).to_i)
rescue Errno::ENOENT, Errno::ESRCH
end
end
ActiveRecord::Base.connection.disconnect! if defined?(ActiveRecord::Base)
end
Let’s add the verification step after deployment.
deploy.rb
We want to trigger our verification procedure for deploy no matter whether we executed it with or without migrations.
Also we don’t want to implement the entire verification procedure algorithm in
this file. So we extract it into './config/deploy/verify'
and require
inside the task.
require 'securerandom'
set :deploy_token, SecureRandom.hex(16)
namespace :deploy do
namespace :verify do
task :prepare, :roles => :app, :except => { :no_release => true } do
run "echo -n #{deploy_token} > #{release_path}/TOKEN"
end
task :check, :roles => :app, :except => { :no_release => true } do
require './config/deploy/verify'
user = 'about'
pass = 'VerySecretPass'
url = "https://#{user}:#{pass}@#{target_host}/about/deploy"
DeployVerification.new(url, deploy_token).start
end
end
end
before "deploy:restart", "deploy:verify:prepare"
after "deploy:restart", "deploy:verify:check"
config/deploy/production.rb
set :target_host, "app.example.com"
config/deploy/staging.rb
set :target_host, "app.example.org"
config/deploy/verify.rb
The whole idea is that we do the request to our just deployed/restarted webapp and check whether it returns randomly generated token that we set before restart. If it does, everything went smoothly and new workers started, they read the new token and are serving it.
If however the new Unicorn workers could not properly start after deploy,
the old workers will be still working and serving requests, including the
request to /about/deploy
that will give us the old token generated during
previous deploy.
It takes some time to start new Rails app, create new workers, kill old workers and for the master unicorn worker to switch to the new process. So we wait max 60s for the entire procedure to finish. In this time we are hitting our application with request every now and then to check whether new workers are serving requests or the old ones.
require 'net/http'
require 'net/https'
require 'timeout'
class DeployVerification
class VerificationFailedAtDir < StandardError; end
def initialize(url, token, timeout = 60)
@timeout = timeout
@url = url
@token = token
end
def start
Timeout.timeout(@timeout) do
begin
uri = URI.parse(@url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.start do |http|
req = Net::HTTP::Get.new(uri.path)
if uri.user && uri.password
req.basic_auth uri.user, uri.password
end
result = http.request(req).body
unless result == @token
raise VerificationFailedAtDir, "Invalid verification token.
Expected: #{@token},
got: #{result}."
end
puts "Verified deploy is running"
end
rescue VerificationFailedAtDir => x
puts x.message
puts "Error when running verification. Retrying... \n"
sleep(0.5)
retry
end
end
end
end
config/routes.rb
get "about/deploy"
app/controllers/about_controller.rb
Here is the controller doing basic auth and serving the token. It does
not try to dynamically read the TOKEN
file because that would
always return the new value written to that file during last deploy.
Instead it returns the token that is instantiated only once during Rails startup process.
class AboutController < ApplicationController
before_filter :http_basic_authentication
def http_basic_authentication
authenticate_or_request_with_http_basic do |name, pass|
name == 'about' && pass == 'VerySecretPass'
end
end
def deploy
render text: Rails.configuration.deploy_token, layout: false
end
end
config/application.rb
Here you can see that we are storing the token when rails is starting.
deploy_token_file = Rails.root.join('TOKEN')
config.deploy_token = if deploy_token_file.exist?
deploy_token_file.read
else
'none'
end
But why?
Now that you know how, you are still probably wondering why.
Not everything can be caught by your tests, especially not errors made in
production environment configuration. That can be even something as simple as
typo in config/environments/production.rb
.
We also experienced gems behaving differently and preventing app from being
started due to tiny difference in environment variables (ENV
). So now,
whenever we manage application that is not hosted in cloud because of customer
preferences, we just add this little script to make sure that the deployed code
was actually deployed and workers restarted properly. Because sending signal
is sometimes just not good enough :)