Sickpea

Sitemaps are an easy way for you to inform search engines what pages on your site are available for crawling. Web crawlers typically discover pages by following links from site to site. Sitemaps provide hints so they can do a better job.

The Sitemap protocol defines a simple XML schema for specifying what URLs are on your site, along with some useful optional information: the date of last modification, how frequently the page is likely to change, and the priority of a given URL relative to other URLs on your site. Here's an example:

/sitemap.xml — Sample sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://sickpea.com/</loc>
    <lastmod>2009-07-17T17:17:32-07:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Here's the bit of code I use to generate a sitemap for this site:

app/controllers/sitemap_controller.rb - Controller code

class SitemapController < ApplicationController

  def sitemap
    @articles = Article.published
    respond_to { |format| format.xml }
  end

end

app/views/sitemap/sitemap.xml.builder - View/builder template

# Sitemaps 0.9 XML format: http://www.sitemaps.org/protocol.php
xml.instruct!
xml.urlset :xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9' do
  xml.url do
    xml.loc root_url
    xml.lastmod @articles.first.updated_at.iso8601
    xml.changefreq 'daily'
    xml.priority '0.8'
  end
  @articles.each do |a|
    xml.url do
      xml.loc published_article_raw_url(a.published_at.year, a.published_at.month, a.slug)
      xml.lastmod a.updated_at.iso8601
    end
  end
end

config/routes.rb - Routes file

ActionController::Routing::Routes.draw do |map|
  ...
  map.sitemap '/sitemap.xml', :controller => 'sitemap', :action => 'sitemap'
  ...
end

This sitemap is generated on demand, which won't scale once the number of articles starts getting large. You'll want to pre-generate and gzip it (e.g., sitemap.xml.gz) if you've got many URLs. Here's an interesting sitemap generator plugin I found on Github you could use for this.

There are a few ways you can get your sitemap to the search engines. One way is to use the webmaster tools offered by the major search engines. These tools give stats on when the sitemap was retrieved and how many URLs are contained within it:

Another way is to reference the sitemap in your robots.txt file:

public/robots.txt

User-agent: *
Disallow:
Sitemap: http://sickpea.com/sitemap.xml

Finally, you can programatically ping the search engines when you have an update. Replace XXX with an absolute URL to your sitemap (properly URL-encoded of course).

  • http://www.google.com/webmasters/tools/ping?sitemap=XXX
  • http://www.bing.com/webmaster/ping.aspx?siteMap=XXX
  • http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=XXX
  • http://submissions.ask.com/ping?sitemap=XXX

There are some limitations built in to the protocol. You may not list more than 50,000 URLs in a single sitemap file, and it can't exceed 10MB uncompressed (you probably won't hit the size limit unless you have long URLs and/or specify all the optional parameters per URL.) When you exceed this, you should create multiple sitemap files and list those in a sitemap index file. These, in turn, also have the 50,000 entry limitation, but I suspect you've got bigger concerns if you have more than 2.5 billion unique URLs!

Archives

Previous article, on Wed, 15 Jul 2009.

How To Self-Sign a Java Applet

Hi, I'm Adrian.

I'm a software engineer and entrepreneur. I like to build things; websites, games, robots, etc. I am currently Chief Architect at Ooga Labs / Wonderhill.

You should follow me on Twitter and subscribe to my RSS feed.

© 1988-2009 Adrian B. Danieli. Some rights reserved.