Creating a sitemap with Ruby on Rails and uploading it to Amazon S3

23 Jul 15

Sitemaps are a must-have tool to get your sites properly indexed on search engines and having a better positioning. In this post we cover how to create sitemaps for your Rails apps and host them on S3 if needed.

What is a sitemap?

A sitemap is an XML file that lists all the URLs to pages on your site considered relevant to be indexed. It also includes information like the last time a given URL was modified or the frequency of updating. Normally it would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
  http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
  ...
  xmlns:xhtml="http://www.w3.org/1999/xhtml"
>
  <url>
    <loc>http://www.cookieshq.co.uk</loc>
    <lastmod>2015-06-12T12:08:23+02:00</lastmod>
    <changefreq>always</changefreq>
    <priority>1.0</priority>
  </url>
  <!--- More URL defintions -->
</urlset>

On the top side we have several sitemap schema definitions (shortened here), and after that, we get all the URLs to be mapped and indexed.

Obviously generating this by hand would be a bit cumbersome, and even more, setting the last modification dates or modifying tiny bits each time something is added to the site would be unsustainable. We need to automate this process.

Enter Karl Varga’s Sitemap Generator gem.

Using the gem

The gem documentation is pretty straightforward about features, installation and configuration. Once you have the gem installed, run rake sitemap:install to have a default config/sitemap.rb file you can edit. Below is a file we have already done some work upon:

# Set the host name for URL creation
SitemapGenerator::Sitemap.default_host = "http://www.cookieshq.co.uk"
# pick a safe place safe to write the files
SitemapGenerator::Sitemap.public_path = 'tmp/sitemaps/'

SitemapGenerator::Sitemap.create do
  add clients_path
  add team_path
  add about_path
  add testimonials_path
  add contact_path
  add posts_path, changefreq: 'weekly'
  add login_path, priority: 0.0

  Post.find_each do |post|
    add post_path(post.slug), lastmod: post.updated_at
  end

  CaseStudy.find_each do |case_study|
    add case_study_path(case_study.slug), lastmod: case_study.updated_at, changefreq: 'never'
  end
end

The default host

As you can see, the first thing we do is set the host URL for our site, which will be the root of all the URLs contained on the resulting XML file.

The path where the sitemap is stored

After that, we set the path where the compressed XML file will be generated. By default it will be the public folder, but we can set it to be any other folder on the project, as long as we have written permission on it (more on this later). If you use the public folder, remember to add the name of the generated file to your .gitignore.

Adding links (finally!)

Then there’s a series of static pages we want indexed, the FAQ, the login page, the Terms & Conditions, etc. These could have been added using add /faqs. Note that you don’t need to add the root_path, as the gem does it automatically for you.

The post index path has the changefreq set to weekly, as we want to indicate the site crawlers and indexers information about how often that index is likely to change. If we were to publish a new post every day, we could set it to daily.

On our login_path we’ve used the priority parameter and set it to zero as we want it to be indexed, but still, we want it to be considered as the least important page for indexers and crawlers, since we want other more important information to appear first on search results. In case we didn’t want it to be indexed, we’d just remove it from the generation code.

The last two additions are more interesting, as they relate to indexing dynamic content. On both our Post and CaseStudy models we have set up a string field named slug that is used on the URL, so instead of having http://www.cookieshq.co.uk/posts/1234 we have http://www.cookieshq.co.uk/posts/sitemap-generation-hosting. To get the posts and case studies indexed the correct way, we need to add the URL for each post searching by slug. Also, we add the lastmod parameter, so we can indicate indexers to omit this URL if it has been indexed before and not changed ever since.

Additionally, we’ve set the changefreq to never on Case Studies, as once a case study is published, it’s unlikely to be changed.

Those would generate XML information like this:

<!--- ... -->
<url>
  <loc>http://www.cookieshq.co.uk/case_studies/gap-medics</loc>
  <lastmod>2015-06-01T15:59:52+00:00</lastmod>
  <changefreq>never</changefreq>
  <priority>0.5</priority>
</url>
<!--- ... -->

Generating the sitemap

The gem offers a series of tasks to create your sitemap:

rake sitemap:create and rake sitemap:refresh:no_ping do the same: run the sitemap.rb and generate the compressed XML file under the folder specified in the public_path attribute.
rake sitemap:refresh: does the same as the previous ones, but it also will ping Google and Bing search engines so they know to fetch your newly created sitemap and update their indexed information about the site. You can ping other search engines as well, as stated in the docs.

Finally, you should set a cron job on your server to call rake sitemap:refresh as often as needed.

Serving the sitemap

Normally, using the default configurations and working on a VPS should not add difficulties to search engines to fetch your sitemap from your public folder, as the file would be reachable from, following with our example: http://www.cookieshq.co.uk/sitemap.xml.gz.

However, in the case our application is hosted on Heroku, we face two problems, due to its ephemeral filesystem:

We can’t write on the public folder. That’s why we use the tmp folder on our previous sitemap configuration file.
We can’t guarantee for how long will be on the tmp folder what we save there.

To get around this, what we need is to host our generated sitemap somewhere else, and then allow the search engines to access it. The Sitemap Generator gem offers ways to save the generated file on S3 using fog or carrierwave, so if you already use either of those on your application, you can have a look at this wiki page. However, installing Fog or Carrierwave just for this can be a bit overkill, so here’s a way to do that depending only on the aws-sdk gem.

Once we have the aws-sdk gem installed, we will also need to have an Amazon S3 bucket and the proper credentials set on the corresponding Heroku configuration panel, and/or your local environment, for tests:

An S3 Access Key Id: ENV['S3_ACCESS_KEY_ID']
An S3 Secret Access Key: ENV['S3_SECRET_ACCESS_KEY']
The name of the bucket to use: ENV['S3_BUCKET']

Once this is set, we will need a rake task like the following:

# sitemap.rake
require 'aws'

namespace :sitemap do
  desc 'Upload the sitemap files to S3'
  task upload_to_s3: :environment do
    puts "Starting sitemap upload to S3..."

    s3 = AWS::S3.new(access_key_id: ENV['S3_ACCESS_KEY_ID'],
                     secret_access_key: ENV['S3_SECRET_ACCESS_KEY'])

    bucket = s3.buckets[ENV['S3_BUCKET']]

    Dir.entries(File.join(Rails.root, "tmp", "sitemaps")).each do |file_name|
      next if ['.', '..', '.DS_Store'].include? file_name
      path = "sitemaps/#{file_name}"
      file = File.join(Rails.root, "tmp", "sitemaps", file_name)

      begin
        object = bucket.objects[path]
        object.write(file: file)
      rescue Exception => e
        raise e
      end
      puts "Saved #{file_name} to S3"
    end
  end
end

First we setup our AWS client with the credentials, and after that we iterate the files present on the public_path we configured for the Sitemap Generator, in this case, tmp/sitemaps. We have to ignore the folder itself ad its parent (. and ..), and if you are doing tests on OS X, the habitual ‘.DS_Store’ files.

Afterwards, we’ll write the file to our remote bucket, under a sitemap folder, which should be configured as writable on your AWS panel.

Finally, we will need a rake task that we can program on our cron that takes care of everything: create the sitemap, upload it to S3 and ping the search engines:

# sitemap.rake
namespace :sitemap do
  # ...
  desc 'Create the sitemap, then upload it to S3 and ping the search engines'
  task create_upload_and_ping: :environment do
    Rake::Task["sitemap:create"].invoke

    Rake::Task["sitemap:upload_to_s3"].invoke

    SitemapGenerator::Sitemap.ping_search_engines('http://www.cookieshq.co.uk/sitemap.xml.gz')
  end
end

Note that on the last invocation, we’re sending the search engines the URL where they can find our sitemap. But the file is not on our server, so we need to do a small amend on our routes.rb:

# routes.rb file
get '/sitemap.xml.gz', to: redirect("https://#{ENV['S3_BUCKET']}.s3.amazonaws.com/sitemaps/sitemap.xml.gz"), as: :sitemap

Going a bit further: testing your sitemap generation script

Recently I found this post by Mike Coutermash, in which he devises a simple test that checks if your sitemap.rb will run properly, using RSpec. Note that it works out-of-the-box only with the latest (5.1.0) version of SitemapGenerator. Anyway, as Coutermash states, it allows you to “check if it runs and you don’t forget to update your sitemap.rb if you change your routes”. Here’s an example based on his base specs:

# spec/lib/sitemap_generator/interpreter_spec.rb
require 'spec_helper'

describe SitemapGenerator::Interpreter do
  describe '.run' do
    it 'does not raise an error' do
      allow(SitemapGenerator::Sitemap).to receive(:ping_search_engines).and_return true
      allow(SitemapGenerator::Sitemap).to receive(:create).and_yield

      FactoryGirl.create_list(:faq, 5)
      FactoryGirl.create_list(:team_member, 3)
      FactoryGirl.create(:post, slug: 'test-slug')
      FactoryGirl.create(:case_study, slug: 'successful-app-story')

      expect { described_class.run }.not_to raise_error
    end
  end
end

Conclusion

I hope this post is helpful to you. On a final note, I’d like to mention as source this post from status203.me by w1zeman1p which I found via the gem’s wiki and was really helpful.

Picture by kaveman743 adapted from the original on flickr, used under CC BY-NC 2.0 license.