Sitemaps, Webmaster Tools, Rails, and You
Sitemaps are an easy way for you to inform search engines what pages on your site are available for crawling. Web crawlers typically discover pages by following links from site to site. Sitemaps provide hints so they can do a better job.
The Sitemap protocol defines a simple XML schema for specifying what URLs are on your site, along with some useful optional information: the date of last modification, how frequently the page is likely to change, and the priority of a given URL relative to other URLs on your site. Here's an example:
/sitemap.xml — Sample sitemap
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://sickpea.com/</loc> <lastmod>2009-07-17T17:17:32-07:00</lastmod> <changefreq>daily</changefreq> <priority>0.8</priority> </url> </urlset>
Here's the bit of code I use to generate a sitemap for this site:
app/controllers/sitemap_controller.rb - Controller code
class SitemapController < ApplicationController def sitemap @articles = Article.published respond_to { |format| format.xml } end end
app/views/sitemap/sitemap.xml.builder - View/builder template
# Sitemaps 0.9 XML format: http://www.sitemaps.org/protocol.php xml.instruct! xml.urlset :xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9' do xml.url do xml.loc root_url xml.lastmod @articles.first.updated_at.iso8601 xml.changefreq 'daily' xml.priority '0.8' end @articles.each do |a| xml.url do xml.loc published_article_raw_url(a.published_at.year, a.published_at.month, a.slug) xml.lastmod a.updated_at.iso8601 end end end
config/routes.rb - Routes file
ActionController::Routing::Routes.draw do |map| ... map.sitemap '/sitemap.xml', :controller => 'sitemap', :action => 'sitemap' ... end
This sitemap is generated on demand, which won't scale once the number of articles
starts getting large. You'll want to pre-generate and gzip it (e.g., sitemap.xml.gz) if you've got many URLs. Here's an interesting
sitemap generator plugin I found on Github you could use for this.
There are a few ways you can get your sitemap to the search engines. One way is to use the webmaster tools offered by the major search engines. These tools give stats on when the sitemap was retrieved and how many URLs are contained within it:
Another way is to reference the sitemap in your robots.txt file:
public/robots.txt
User-agent: * Disallow: Sitemap: http://sickpea.com/sitemap.xml
Finally, you can programatically ping the search engines when you have an update.
Replace XXX with an absolute URL to your sitemap (properly URL-encoded
of course).
- http://www.google.com/webmasters/tools/ping?sitemap=XXX
- http://www.bing.com/webmaster/ping.aspx?siteMap=XXX
- http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=XXX
- http://submissions.ask.com/ping?sitemap=XXX
There are some limitations built in to the protocol. You may not list more than 50,000 URLs in a single sitemap file, and it can't exceed 10MB uncompressed (you probably won't hit the size limit unless you have long URLs and/or specify all the optional parameters per URL.) When you exceed this, you should create multiple sitemap files and list those in a sitemap index file. These, in turn, also have the 50,000 entry limitation, but I suspect you've got bigger concerns if you have more than 2.5 billion unique URLs!