Sickpea

Sitemaps are an easy way for you to inform search engines what pages on your site are available for crawling. Web crawlers typically discover pages by following links from site to site. Sitemaps provide hints so they can do a better job.

The Sitemap protocol defines a simple XML schema for specifying what URLs are on your site, along with some useful optional information: the date of last modification, how frequently the page is likely to change, and the priority of a given URL relative to other URLs on your site. Here's an example:

/sitemap.xml — Sample sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://sickpea.com/</loc>
    <lastmod>2009-07-17T17:17:32-07:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Here's the bit of code I use to generate a sitemap for this site:

app/controllers/sitemap_controller.rb - Controller code

class SitemapController < ApplicationController

  def sitemap
    @articles = Article.published
    respond_to { |format| format.xml }
  end

end

app/views/sitemap/sitemap.xml.builder - View/builder template

# Sitemaps 0.9 XML format: http://www.sitemaps.org/protocol.php
xml.instruct!
xml.urlset :xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9' do
  xml.url do
    xml.loc root_url
    xml.lastmod @articles.first.updated_at.iso8601
    xml.changefreq 'daily'
    xml.priority '0.8'
  end
  @articles.each do |a|
    xml.url do
      xml.loc published_article_raw_url(a.published_at.year, a.published_at.month, a.slug)
      xml.lastmod a.updated_at.iso8601
    end
  end
end

config/routes.rb - Routes file

ActionController::Routing::Routes.draw do |map|
  ...
  map.sitemap '/sitemap.xml', :controller => 'sitemap', :action => 'sitemap'
  ...
end

This sitemap is generated on demand, which won't scale once the number of articles starts getting large. You'll want to pre-generate and gzip it (e.g., sitemap.xml.gz) if you've got many URLs. Here's an interesting sitemap generator plugin I found on Github you could use for this.

There are a few ways you can get your sitemap to the search engines. One way is to use the webmaster tools offered by the major search engines. These tools give stats on when the sitemap was retrieved and how many URLs are contained within it:

Another way is to reference the sitemap in your robots.txt file:

public/robots.txt

User-agent: *
Disallow:
Sitemap: http://sickpea.com/sitemap.xml

Finally, you can programatically ping the search engines when you have an update. Replace XXX with an absolute URL to your sitemap (properly URL-encoded of course).

  • http://www.google.com/webmasters/tools/ping?sitemap=XXX
  • http://www.bing.com/webmaster/ping.aspx?siteMap=XXX
  • http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=XXX
  • http://submissions.ask.com/ping?sitemap=XXX

There are some limitations built in to the protocol. You may not list more than 50,000 URLs in a single sitemap file, and it can't exceed 10MB uncompressed (you probably won't hit the size limit unless you have long URLs and/or specify all the optional parameters per URL.) When you exceed this, you should create multiple sitemap files and list those in a sitemap index file. These, in turn, also have the 50,000 entry limitation, but I suspect you've got bigger concerns if you have more than 2.5 billion unique URLs!

Archives

Wed, 15 Jul 2009

How To Self-Sign a Java Applet

Tue, 18 May 2010

Don't call it a comeback...

Hi, I'm Adrian (@sickp).

I like to build things: websites, games, robots, and mobile apps. I'm a software tinkerer and an MIT-approved engineer (i.e. they can ask me for money.)

During the day I help build fine games at Wonderhill, and lend my expertise to other Ooga Labs companies. In my spare time, I create useful iPhone apps at Zooble with my wife, Alexandra.

You should follow me on Twitter and subscribe to this site's RSS feed.

© 1988-2010 Adrian B. Danieli. Some rights reserved.