Zero uptime deploy

… and check why 5600+ Rails engineers read also this

Zero uptime deploy

Photo remix available thanks to the courtesy of Rob Boudon. CC BY 2.0

Unicorn has a nice feature that bought it a lot of popularity and set standards for other Ruby web servers: The ability to do Zero Downtime Deploy, also known by the name rolling deploy or rolling restart aka hot restart. You start it by issuing USR2 signal. But here is something that most websites won’t tell you. It can fail and you won’t even notice. You will be thinking that everything went ok, living in Wonderland, whereas in reality your deploy achieved uptime of exactly 0 seconds.

So what you need is a small verification procedure that everything worked as expected. This article will demonstrate simple solution for achieving it in case you are using capistrano for deploying the app. However you can use very similar procedure if you deploy your app with other tools.

Here is what we assume that you already have

deploy.rb

Nothing fancy here. As the documentation states:

USR2 signal for master process - reexecute the running binary. A separate QUIT should be sent to the original process once the child is verified to be up and running.

namespace :deploy do
  task :reload, :roles => :app, :except => { :no_release => true } do
    run "#{try_sudo} kill -s USR2 `cat #{unicorn_pid}`"
  end

  task :restart, :roles => :app, :except => { :no_release => true } do
    reload
  end
end

config/unicorn.rb

Whenever we spawn new child process we decrement the number of worker processes by one with sending TTOU signal to master process.

At the end we send QUIT so the new master worker can take it place.

before_fork do |server, worker|
  old_pid = "#{server.config[:pid]}.oldbin"
  if old_pid != server.pid
    begin
      sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU
      Process.kill(sig, File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
    end
  end

  ActiveRecord::Base.connection.disconnect! if defined?(ActiveRecord::Base)
end

Let’s add the verification step after deployment.

deploy.rb

We want to trigger our verification procedure for deploy no matter whether we executed it with or without migrations.

Also we don’t want to implement the entire verification procedure algorithm in this file. So we extract it into './config/deploy/verify' and require inside the task.


require 'securerandom'
set :deploy_token, SecureRandom.hex(16)

namespace :deploy do
  namespace :verify do
    task :prepare, :roles => :app, :except => { :no_release => true } do
      run "echo -n #{deploy_token} > #{release_path}/TOKEN"
    end

    task :check, :roles => :app, :except => { :no_release => true } do
      require './config/deploy/verify'

      user = 'about'
      pass = 'VerySecretPass'
      url  = "https://#{user}:#{pass}@#{target_host}/about/deploy"
      DeployVerification.new(url, deploy_token).start
    end
  end
end

before "deploy:restart", "deploy:verify:prepare"
after  "deploy:restart", "deploy:verify:check"

config/deploy/production.rb

set :target_host, "app.example.com"

config/deploy/staging.rb

set :target_host, "app.example.org"

config/deploy/verify.rb

The whole idea is that we do the request to our just deployed/restarted webapp and check whether it returns randomly generated token that we set before restart. If it does, everything went smoothly and new workers started, they read the new token and are serving it.

If however the new Unicorn workers could not properly start after deploy, the old workers will be still working and serving requests, including the request to /about/deploy that will give us the old token generated during previous deploy.

It takes some time to start new Rails app, create new workers, kill old workers and for the master unicorn worker to switch to the new process. So we wait max 60s for the entire procedure to finish. In this time we are hitting our application with request every now and then to check whether new workers are serving requests or the old ones.

require 'net/http'
require 'net/https'
require 'timeout'

class DeployVerification
  class VerificationFailedAtDir < StandardError; end

  def initialize(url, token, timeout = 60)
    @timeout = timeout
    @url     = url
    @token   = token
  end

  def start
    Timeout.timeout(@timeout) do
      begin
        uri  = URI.parse(@url)
        http = Net::HTTP.new(uri.host, uri.port)
        http.use_ssl = true
        http.start do |http|
          req = Net::HTTP::Get.new(uri.path)

          if uri.user && uri.password
            req.basic_auth uri.user, uri.password
          end

          result = http.request(req).body
          unless result == @token
            raise VerificationFailedAtDir, "Invalid verification token.
                                            Expected: #{@token},
                                            got: #{result}."
          end
          puts "Verified deploy is running"
        end
      rescue VerificationFailedAtDir => x
        puts x.message
        puts "Error when running verification. Retrying... \n"
        sleep(0.5)
        retry
      end
    end
  end
end

config/routes.rb


get "about/deploy"

app/controllers/about_controller.rb

Here is the controller doing basic auth and serving the token. It does not try to dynamically read the TOKEN file because that would always return the new value written to that file during last deploy.

Instead it returns the token that is instantiated only once during Rails startup process.

class AboutController < ApplicationController

  before_filter :http_basic_authentication

  def http_basic_authentication
    authenticate_or_request_with_http_basic do |name, pass|
      name == 'about' && pass == 'VerySecretPass'
    end
  end

  def deploy
    render text: Rails.configuration.deploy_token, layout: false
  end
end

config/application.rb

Here you can see that we are storing the token when rails is starting.

deploy_token_file   = Rails.root.join('TOKEN')
config.deploy_token = if deploy_token_file.exist?
  deploy_token_file.read
else
  'none'
end

But why?

Now that you know how, you are still probably wondering why.

Not everything can be caught by your tests, especially not errors made in production environment configuration. That can be even something as simple as typo in config/environments/production.rb.

We also experienced gems behaving differently and preventing app from being started due to tiny difference in environment variables (ENV). So now, whenever we manage application that is not hosted in cloud because of customer preferences, we just add this little script to make sure that the deployed code was actually deployed and workers restarted properly. Because sending signal is sometimes just not good enough :)

You might also like