If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

Back before there was Google, the big new search engine out there was AltaVista. In an effort to show off its power, the AltaVista team from Digital decided to crawl and index the entire web, which was a new concept at the time. There were many who didn’t like the idea of a “robot” program accessing every page on their web sites because it would cause more load time to their web servers and increase bandwidth costs for them. To address their growing concerns, in 1996 the Robots Exclusion Standard was created.
You can use a simple text file called robots.txt to keep search engines out of a directory. Here is a very simple example that will prevent all search engines (user-agents) from accessing the /images directory.
User-agent: * Disallow: /images
When you block the /images directory, you also block all subdirectories. For example, the directory /images/logos and the file /images.html will also be disallowed.
Strange enough, the first draft of this standard did not contain an “Allow” directive. Later on this has been added, yet without a guarantee of support by all search engines. This implies that anything not specifically disallowed has to be seen as a target for web crawlers.
To disallow access to your entire web site use a robots.txt like this:
User-agent: * Disallow: /
The next lines apply to every search robot when the User-agent is *. Through the specification of the signature of a web crawler as User-agent specific instructions can be given to such a search robot.
User-agent: Googlebot Disallow: /google-secrets
Since the original spec was published several search engines have extended the protocol. One popular extension is to allow wildcards.
User-agent: Slurp Disallow: /*.gif$
As a result, Yahoo!’s web crawler (named Slurp) cannot index files on your site if they end in .gif. You do need to preface these lines with the requisite user-agent line, since not every search engine presently supports wildcard matches.
You can merge a number of these practices into one robots.txt file. To illustrate that theory, here is an instance.
User-agent: * Disallow: /bar User-agent: Googlebot Allow: /foo Disallow: /bar Disallow: /*.gif$ Disallow: /
Computer applications work great when it comes to following well defined instructions. The human brain however is less efficient at these functions, so the best advice is to keep things simple.
For us mortals there is a robots.txt analysis tool in Google’s webmaster tools. Highly recommended. Another good resource for more information on the Robots Exclusion Standard is www.robotstxt.org.
Many corporations are prepared to drop a large sum on getting their page into search engine listings. So leaving yourself out of this mix might seem backward. However, there are smart security reasons to putting a limit on how much of your site a search engine can index.
A good resource to check out is the Digital Security Report.
0 responses so far ↓
There are no comments yet...Kick things off by filling out the form below.
Leave a Comment