Sitemaps with a bit of Metal

… and check why 5600+ Rails engineers read also this

Sitemaps with a bit of Metal

Photo remix available thanks to the courtesy of tinkerbrad. CC BY 2.0

Sooner or later, you will probably start taking care about your application’s SEO, especially if it provides a lot of content and you want to be discovered by users in search engines results. There are several ways to do this in your Ruby app. You can create sitemap.xml file manually if there aren’t very much urls, but it will become pretty ineffective when you have more than a dozen or so. There are some very neat tools which will do this for you out of the box, even if you need a lot of customization.

Tools to the rescue

Tool which I would like to mention is Sitemap Generator by kjvarga. It’s pretty cool, it keeps the standards so you don’t have to care too much. It also have custom rake tasks, which will generate Sitemap under given criteria and ping selected search engines about availability of new one one for your site. Magic.

Installation is very easy. You only need to add one line to your Gemfile:

sitemap_generator

Then you should run bundle and rake sitemap:install. Now you should have config/sitemap.rb in your directory structure, which you need to tweak for your needs.

SitemapGenerator::Sitemap.default_host = 'http://example.com'
SitemapGenerator::Sitemap.create do
  add '/home', :changefreq => 'daily', :priority => 0.9
  add '/contact_us', :changefreq => 'weekly'
end

And that’s it! All you need to do is to run rake sitemap:refresh. Now you have new sitemap.xml.gz file in your /public directory.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.example.com/home</loc>
      <changefreq>daily</changefreq>
      <priority>0.9</priority>
   </url>
   <url>
      <loc>http://www.example.com/contact_us</loc>
      <changefreq>weekly</changefreq>
      <priority>0.9</priority>
   </url>
</urlset>

If your app handle multiple domains, there’s no problem, because you can render multiple sitemap files for different domains, subdomains or specific locales.

Hitting Sitemap limits

You might also heard that a single Sitemap must have no more than 50,000 URLs and can’t be larger than 10MB. And it’s true. There was a risk that our app will hit that limit in close future. Fortunatelly Sitemap protocol provides a possibility to handle such situation through index files. As I mentioned earlier, sitemap_generator keeps the standards pretty good, so it creates index file if such one is needed by default. You can also force it to always create index file.

Let’s use some real life example. Mentioned application is presenting a huge amount of events and allows users to buy tickets to them. We will fetch all events from database through find_each method to get objects in batches. We do this in case that large amount of objets could not fit into memory. On each event we would use event_path helper to add proper URL to our sitemap.

SitemapGenerator::Sitemap.create_index = true

SitemapGenerator::Sitemap.default_host = 'http://example.com'
SitemapGenerator::Sitemap.create do
  Event.find_each do |event|
    add event_path(event.slug, locale: false), lastmod: event.updated_at, changefreq: 'daily'
  end
end

After running rake sitemaps:refresh we now have at least two files: sitemap.xml.gz and sitemap1.xml.gz. At least because for each n multiple of 50,000, sitemap{n}.xml.gz would get created.

Let’s take a close look at sitemap.xml.gz content:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <sitemap>
    <loc>http://www.example.com/sitemap1.xml.gz</loc>
    <lastmod>2004-10-01T18:23:17+00:00</lastmod>
 </sitemap>
 <sitemap>
    <loc>http://www.example.com/sitemap2.xml.gz</loc>
    <lastmod>2005-01-01</lastmod>
 </sitemap>
</sitemapindex>

It no longer contains a Sitemap, but index which specifies where the Sitemaps are.

Content of sitemap1.xml.gz:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.example.com/events/awesome_event_1</loc>
      <changefreq>daily</changefreq>
      <priority>0.5</priority>
   </url>
   <!-- ... -->
   <url>
      <loc>http://www.example.com/events/awesome_event_50000</loc>
      <changefreq>daily</changefreq>
      <priority>0.5</priority>
   </url>
</urlset>

Content of sitemap2.xml.gz:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.example.com/events/awesome_event_50001</loc>
      <changefreq>daily</changefreq>
      <priority>0.5</priority>
   </url>
   <!-- ... -->
   <url>
      <loc>http://www.example.com/events/awesome_event_100001</loc>
      <changefreq>daily</changefreq>
      <priority>0.5</priority>
   </url>
</urlset>

Pretty easy, isn’t it?

Great, but I don’t want to keep this in my /public directory

If you use CDN for static files and don’t want to keep Sitemap in your /public directory, you can use specific adapter and just customize this in config file:

SitemapGenerator::Sitemap.adapter = SitemapGenerator::WaveAdapter.new

or even

SitemapGenerator::Sitemap.adapter = SitemapGenerator::S3Adapter.new

I really appreciate how the author of the gem solved different storage mechanisms. If you need more customization in this area, you can just write your compatible adapter and save Sitemap whenever you want: database, key-value storage or whatever. If we want to use ActiveRecord for this purpose, we can just write:

module SitemapGenerator
  class ActiveRecordAdapter
    def write(location, raw_data)
      Sitemap.new do |sitemap|
        sitemap.data      = gzip(StringIO.new, raw_data).string
        sitemap.filename  = location[:filename]
        sitemap.mime_type = 'multipart/x-gzip'
      end.save!
    end

    def gzip(location, string)
      gz = Zlib::GzipWriter.new(location)
      gz.write string
      gz.close
    end
  end
end

One more thing to do to keep this running is to create such db migration:

class CreateSitemaps < ActiveRecord::Migration
  def change
    create_table :sitemaps do |t|
      t.binary :data,      null: false
      t.string :filename,  null: false
      t.string :mime_type, null: false

      t.timestamps
    end

    add_index :sitemaps, [:filename, :created_at]
  end
end

We must also update our config/sitemap.rb file and tell that we want to use custom adapter:

SitemapGenerator::Sitemap.adapter = SitemapGenerator::ActiveRecordAdapter.new

Ok, now we have up and running creation of Sitemap. But how to render it if it’s no longer available in /public directory? We need to find away to get the file from db and render to user, in this case search engine crawler.

Let’s write custom controller

To render our file we need to create proper controller. In typical Rails application we would probably do something like this:

class SitemapsController < ApplicationController
  skip_before_filter :authenticate_user! # because you use devise, don't you?

  def show
    sitemap = Sitemap.where(filename: params[:id]).order('created_at desc').first!
    respond_to do |format|
      format.xml_gz { send_data sitemap.data, filename: sitemap.filename }
    end
  end
end

Our controller responds to xml_gz format which is not supported in Rails by default. We need to register this format, so our controller could render proper response when *.xml.gz format is requested by client. We can to do this by putting line below in config/initializers/mime_types.rb file.

Mime::Type.register "application/x-gzip", :xml_gz, [], ["xml.gz"]

One more necessary thing is adding these few lines to config/routes.rb:

constraints(format: /[a-z]+(\.[a-z]+)?/) do
  resources :sitemaps, only: :show
  get '/sitemap.:format' => 'sitemaps#show'
end

We use constraints on format because we need to handle non standard, double resource extension xml.gz. Without this, our Rails app would lookup for resource with .gz extension and sitemap.xml would be treated as filename.

Let’s take a look what exactly our controller has inside:

irb(main):001:0> SitemapsController.ancestors
=> [
     SitemapsController,
     ApplicationController,
     #<Module:0x007fc2178c35a8>,
     #<Module:0x007fc2179a01b0>,
     ActionController::Base,
     Turbolinks::XHRHeaders,
     Turbolinks::Cookies,
     Turbolinks::XDomainBlocker,
     Turbolinks::Redirection,
     Devise::Controllers::UrlHelpers,
     Devise::Controllers::Helpers,
     Devise::Controllers::StoreLocation,
     Devise::Controllers::SignInOut,
     ActiveRecord::Railties::ControllerRuntime,
     ActionDispatch::Routing::RouteSet::MountedHelpers,
     ActionController::ParamsWrapper,
     ActionController::Instrumentation,
     ActionController::Rescue,
     ActionController::HttpAuthentication::Token::ControllerMethods,
     ActionController::HttpAuthentication::Digest::ControllerMethods,
     ActionController::HttpAuthentication::Basic::ControllerMethods,
     ActionController::RecordIdentifier,
     ActionController::DataStreaming,
     ActionController::Streaming,
     ActionController::ForceSSL,
     ActionController::RequestForgeryProtection,
     ActionController::Flash,
     ActionController::Cookies,
     ActionController::StrongParameters,
     ActiveSupport::Rescuable,
     ActionController::ImplicitRender,
     ActionController::MimeResponds,
     ActionController::Caching,
     ActionController::Caching::Fragments,
     ActionController::Caching::ConfigMethods,
     AbstractController::Callbacks,
     ActiveSupport::Callbacks,
     ActionController::ConditionalGet,
     ActionController::Head,
     ActionController::Renderers::All,
     ActionController::Renderers,
     ActionController::Rendering,
     ActionController::Redirecting,
     ActionController::RackDelegation,
     ActiveSupport::Benchmarkable,
     AbstractController::Logger,
     ActionController::UrlFor,
     AbstractController::UrlFor,
     ActionDispatch::Routing::UrlFor,
     ActionDispatch::Routing::PolymorphicRoutes,
     ActionController::ModelNaming,
     ActionController::HideActions,
     ActionController::Helpers,
     AbstractController::Helpers,
     AbstractController::AssetPaths,
     AbstractController::Translation,
     AbstractController::Layouts,
     AbstractController::Rendering,
     AbstractController::ViewPaths,
     ActionController::Metal,
     AbstractController::Base,
     ActiveSupport::Configurable,
     Object,
     PP::ObjectMixin,
     ActiveSupport::Dependencies::Loadable,
     JSON::Ext::Generator::GeneratorMethods::Object,
     Kernel,
     BasicObject
   ]

irb(main):002:0> SitemapsController.ancestors.count
=> 68

But do we really need to carry whole this stuff which is usually inherited by ApplicationController? How about no. We don’t really need skipping before filters, we don’t need url helpers, turbolinks, devise and any other useless in this case stuff. So, let’s slim this down a bit.

Here comes the Metal

class SitemapsController < ActionController::Metal
  include AbstractController::Rendering
  include ActionController::MimeResponds
  include ActionController::DataStreaming
  include ActionController::RackDelegation
  include ActionController::Rescue
  include ActionController::Head

  def show
    sitemap = Sitemap.where(filename: params[:id]).order('created_at desc').first!
    respond_to do |format|
      format.xml_gz { send_data sitemap.data, filename: sitemap.filename }
    end
  end
end

Let’s take a look what we have achieved:

irb(main):004:0> SitemapsController.ancestors
=> [
     SitemapsController,
     ActionController::Head,
     ActionController::Rescue,
     ActiveSupport::Rescuable,
     ActionController::RackDelegation,
     ActionController::DataStreaming,
     ActionController::Rendering,
     ActionController::MimeResponds,
     AbstractController::Rendering,
     AbstractController::ViewPaths,
     ActionController::Metal,
     AbstractController::Base,
     ActiveSupport::Configurable,
     Object,
     PP::ObjectMixin,
     ActiveSupport::Dependencies::Loadable,
     JSON::Ext::Generator::GeneratorMethods::Object,
     Kernel,
     BasicObject
   ]

irb(main):005:0> SitemapsController.ancestors.count
=> 19

Our controller is much lighter and contains mostly necessary things to serve our sitemap.xml.gz file to Google, Bing, Yandex or whoever wants our Sitemap. José Valim inspired me to use ActionController::Metal in his Crafting Rails Applications book. Picking only those modules which are indispensable for our controllers is a pretty cool approach, but in my humble opinion, not often seen in Rails applications. Mounting Sinatra application with requested functionality in routes.rb or config.ru could be alternative, but still lightweight solution.

Making sales and marketing happy

As you can see, such chore like rendering Sitemap can be done in a smart way, in most of the cases with just few lines of code. It’s very useful, especially for the applications with a lot of content. Easy customization is another advantage of presented solution. And now go and make your sales and marketing teams happy providing better search engine results.

I will try to write another blogpost focused on usage of ActionController::Metal in different, but maybe more surprising use case, on condition that this topic is interesting for you and Robert won’t forestall me. :)

You might also like