Improving on Robots Exclusion Protocol

It's been a while since we published this blog post. Some of the information may be outdated (for example, some images may be missing, and some links may not work anymore). Read the up-to-date documentation aboutrobots.txt.

Tuesday, June 03, 2008

Web publishers often ask us how they can maximize their visibility on the web. Much of this has to do with search engine optimization—making sure a publisher's content shows up on all the search engines.

However, there are some cases in which publishers need to communicate more information to search engines—like the fact that theydon't want certain content to appear in search results. And for that they use something called theRobots Exclusion Protocol (REP), which lets publishers control how search engines access their site: whether it's controlling the visibility of their content across their site (via robots.txt) or down to a much more granular level for individual pages (viameta tags).

Since it was introduced in the early '90s, REP has become the de facto standard by which web publishers specify which parts of their site they want public and which parts they want to keep private. Today, millions of publishers use REP as an easy and efficient way to communicate with search engines. Its strength lies in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots, and in the way it works for any publisher, no matter how large or small.

While REP is observed by virtually all search engines, we've never come together to detail how we each interpret different tags. Over the last couple of years, we have worked with Microsoft and Yahoo! to bring forward standards such asSitemaps and offer additional tools for webmasters. Since the original announcement, we have, and will continue to, deliver further improvements based on what we are hearing from the community.

Today, in that same spirit of making the lives of webmasters simpler, we're releasing detailed documentation about how we implement REP. This will provide a common implementation for webmasters and make it easier for any publisher to know how their REP rules will be handled by three major search providers—making REP more intuitive and friendly to even more publishers on the web.

So, without further ado...

Common REP rules

The following list are all the major REP features currently implemented by Google, Microsoft, and Yahoo!. With each feature, you'll see what it does and how you should communicate it.

Each of these rules can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Apart from the identification by user-agent, each of our crawlers also supports Reverse DNS based authentication to allow you to verify the identity of the crawler.

Robots.txt rules

RuleImpactUse cases
Disallow Tells a crawler not to index your site—your site's robots.txt file still needs to be crawled to find this rule, however disallowed pages will not be crawled 'No Crawl' page from a site. This rule in the default syntax prevents specific path(s) of a site from being crawled.
Allow Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed except for a small section within it
$ Wildcard Support Tells a crawler to match everything from the end of a URL—large number of directories without specifying specific pages 'No Crawl' files with specific patterns, for example, files with certain filetypes that always have a certain extension, say pdf
* Wildcard Support Tells a crawler to match a sequence of characters 'No Crawl' URLs with certain patterns, for example, disallow URLs with session ids or other extraneous parameters
Sitemaps LocationTells a crawler where it can find your Sitemaps Point to other locations where feeds exist to help crawlers find URLs on a site

HTMLmeta rules

RuleImpactUse cases
noindexmeta tagTells a crawler not to index a given page Don't index the page. This allows pages that are crawled to be kept out of the index.
nofollowmeta tagTells a crawler not to follow a link to other content on a given page Prevent publicly writeable areas to be abused by spammers looking for link credit. By usingnofollow you let the robot know that you are discounting all outgoing links from this page.
nosnippetmeta tag Tells a crawler not to display snippets in the search results for a given pagePresent no snippet for the page on Search Results
noarchivemeta tagTells a search engine not to show a "cached" link for a given page Do not make available to users a copy of the page from the Search Engine cache
noodpmeta tag Tells a crawler not to use a title and snippet from the Open Directory Project for a given page Do not use the ODP (Open Directory Project) title and snippet for this page

These rules are applicable for all forms of content. They can be placed in either the HTML of a page or in the HTTP header for non-HTML content, for example, PDF, video, etc. using anX-Robots-Tag. You can read more about it here:X-Robots-Tag Post or inour series of posts about using robots andmeta tags.

Other REP rules

The rules listed above are used by Microsoft, Google, and Yahoo!, but may not be implemented by all other search engines. In addition, the following rules are supported by Google, but are not supported by all three as are those above:

unavailable_aftermeta tag - Tells a crawlerwhen a page should "expire", that is, after which date it should not show up in search results.

noimageindexmeta tag - Tells a crawlernot to index images for a given page in search results.

notranslatemeta tag - Tells a crawlernot to translate the content on a page into different languages for search results.

Going forward, we plan to continue to work together to ensure that as new uses of REP arise, we're able to make it as easy as possible for webmasters to use them. So stay tuned for more!

Learn more

You can find out more about robots.txt in ourdocumentation and atGoogle's Webmaster help center, which contains lots of helpful information, including:

We've also done several posts in ourwebmaster blog about robots.txt that you may find useful, such as:

There is also auseful list of the bots used by the major search engines.

To see what our colleagues have to say, you can also check out the blog posts published byYahoo! andMicrosoft.

Written by Prashanth Koppula, Product Manager

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.