Sitemap format (XML)

  03/26/07 20:38, by , Categories: Specs

The Sitemap protocol format consists of XML tags. All data values in a Sitemap must be entity-escaped. The file itself must be UTF-8 encoded.

The Sitemap must:

  • Begin with an opening <urlset> tag and end with a closing </urlset> tag.
  • Specify the namespace (protocol standard) within the <urlset> tag.
  • Include a <url> entry for each URL, as a parent XML tag.
  • Include a <loc> child entry for each <url> parent tag.

All other tags are optional. Support for these optional tags may vary among search engines. Refer to each search engine's documentation for details.

Sample XML Sitemap

The following example shows a Sitemap that contains just one URL and uses all optional tags. The optional tags are in italics.

XML

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset>
Leave a comment »

Sitemap tags

  04/08/07 00:28, by , Categories: Specs

<urlset>

required

Encapsulates the file and references the current protocol standard.

<url>

required

Parent tag for each URL entry. The remaining tags are children of this tag.

<loc>

required

URL of the page. This URL must begin with the protocol (such as http) and end with a trailing slash, if your web server requires it. This value must be less than 2,048 characters.

<lastmod>

optional

The date of last modification of the file. This date should be in W3C Datetime format. This format allows you to omit the time portion, if desired, and use YYYY-MM-DD.

Note that this tag is separate from the If-Modified-Since (304) header the server can return, and search engines may use the information from both sources differently.

<changefreq>

optional

How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:

  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

The value "always" should be used to describe documents that change each time they are accessed. The value "never" should be used to describe archived URLs.

Please note that the value of this tag is considered a hint and not a command. Even though search engine crawlers may consider this information when making decisions, they may crawl pages marked "hourly" less frequently than that, and they may crawl pages marked "yearly" more frequently than that. Crawlers may periodically crawl pages marked "never" so that they can handle unexpected changes to those pages.

<priority>

optional

The priority of this URL relative to other URLs on your site. Valid values range from 0.0 to 1.0. This value does not affect how your pages are compared to pages on other sites—it only lets the search engines know which pages you deem most important for the crawlers.

The default priority of a page is 0.5.

Please note that the priority you assign to a page is not likely to influence the position of your URLs in a search engine's result pages. Search engines may use this information when selecting between URLs on the same site, so you can use this tag to increase the likelihood that your most important pages are present in a search index.

Also, please note that assigning a high priority to all of the URLs on your site is not likely to help you. Since the priority is relative, it is only used to select between URLs on your site.

Leave a comment »

Entity escaping

  04/08/07 00:42, by , Categories: Specs

Your Sitemap file must be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters listed in the table below.

Character   Escape Code
Ampersand & &amp;
Single Quote ' &apos;
Double Quote " &quot;
Greater Than > &gt;
Less Than < &lt;

In addition, all URLs (including the URL of your Sitemap) must be URL-escaped and encoded for readability by the web server on which they are located. However, if you are using any sort of script, tool, or log file to generate your URLs (anything except typing them in by hand), this is usually already done for you. Please check to make sure that your URLs follow the RFC-3986 standard for URIs, the RFC-3987 standard for IRIs, and the XML standard.

Below is an example of a URL that uses a non-ASCII character (ü), as well as a character that requires entity escaping (&):

Code

http://www.example.com/ümlat.html&q=name

Below is that same URL, ISO-8859-1 encoded (for hosting on a server that uses that encoding) and URL escaped:

Code

http://www.example.com/%FCmlat.html&q=name

Below is that same URL, UTF-8 encoded (for hosting on a server that uses that encoding) and URL escaped:

Code

http://www.example.com/%C3%BCmlat.html&q=name

Below is that same URL, but also entity escaped:

Code

http://www.example.com/%C3%BCmlat.html&amp;q=name

Sample XML Sitemap

The following example shows a Sitemap in XML format. The Sitemap in the example contains a small number of URLs, each using a different set of optional parameters.

XML

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii</loc>
      <changefreq>weekly</changefreq>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand</loc>
      <lastmod>2004-12-23</lastmod>
      <changefreq>weekly</changefreq>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=74&amp;desc=vacation_newfoundland</loc>
      <lastmod>2004-12-23T18:00:15+00:00</lastmod>
      <priority>0.3</priority>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=83&amp;desc=vacation_usa</loc>
      <lastmod>2004-11-23</lastmod>
   </url>
</urlset>
1 comment »

Sitemap index files

  04/08/07 00:50, by , Categories: Specs

You can provide multiple Sitemap files, but each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 10MB (10,485,760 bytes). If you would like, you may compress your Sitemap files using gzip to stay within 10MB and reduce your bandwidth requirement. If you want to list more than 50,000 URLs, you must create multiple Sitemap files.

If you do provide multiple Sitemaps, you should then list each Sitemap file in a Sitemap index file. Sitemap index files may not list more than 1,000 Sitemaps and must be no larger than 10MB (10,485,760 bytes). The XML format of a Sitemap index file is very similar to the XML format of a Sitemap file.

The Sitemap index file must:

  • Begin with an opening <sitemapindex> tag and end with a closing tag.
  • Include a <sitemap> entry for each Sitemap as a parent XML tag.
  • Include a <loc> child entry for each <sitemap> parent tag.

The optional <lastmod> tag is also available for Sitemap index files.

Note: A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.example.com or http://yourhost.yoursite.com. As with Sitemaps, your Sitemap index file must be UTF-8 encoded.

Sample XML Sitemap Index

The following example shows a Sitemap index that lists two Sitemaps:

XML

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap1.xml.gz</loc>
      <lastmod>2004-10-01T18:23:17+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>http://www.example.com/sitemap2.xml.gz</loc>
      <lastmod>2005-01-01</lastmod>
   </sitemap>
</sitemapindex>

Note: Sitemap URLs, like all values in your XML files, must be entity escaped.

1 comment »

Sitemap index tags

  04/08/07 00:58, by , Categories: Specs

<sitemapindex>

required

Encapsulates information about all of the Sitemaps in the file.

<sitemap>

required

Encapsulates information about an individual Sitemap.

<loc>

required

Identifies the location of the Sitemap.

This location can be a Sitemap, an Atom file, RSS file or a simple text file.

<lastmod>

optional

Identifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. The value for the lastmod tag should be in W3C Datetime format.

By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index i.e. a crawler may only retrieve Sitemaps that were modified since a certain date. This incremental Sitemap fetching mechanism allows for the rapid discovery of new URLs on very large sites.

Leave a comment »

::