Star Crusaders Star Crusaders
  Index Page >> About Us >> Place Your Link >> Security & Privacy >> Terms of Service >> Submit Article
Search:   
Add Url
 

Health & Therapy

Computers & Software

Art & Culture

Companies & Business

Home & Garden

Realty & Property

Fashion & Lifestyle

Automobile & Automotive

Self Management

Academics & Learning

Malls & Shopping

Children & Teens

Outdoor & Sports

Society & Communities

Eating & Drinking

Medical Care

Science & Research

Government & Politics

Recreation & Entertainment

Finance & Investment

Travel & Vacation

Employment & Careers

News & Media

Indoor Games

 

  Index Page » Science & Research » Robot Technology
   
 

The Robots.txt File

   

Since the beginning of Internet there is a need to index the Web and many robots are built for this purpose. You already know that famous Google bot which is indexing the Web to keep track of urls and build a scheme out of it (link popularity algorithm...).

There are not so many ways to scan a website but some pages of a website might not need to be crawled for any reasons such as privacy...

A Standard for Robot Exclusion has been created and now robots from search engines or others look forward the robots.txt file before starting to scan a website. This file tells the robots which links are allowed to be scanned and which links shouldn't be indexed.

A good resource about the robots.txt file is at this address:

http://www.robotstxt.org

The site publishes information about Web robots, you may be interested by this site if you plan to create your own Bot or learn more about their history.

Practice
You may have noticed from your server logs the presence of the robots.txt request from an ant, it throws a file not exist error when you don't have the robots.txt file. If you just want to clean this from the log files then you need to create this robots text file even if you make it empty.

The structure of this file is pretty simple, you can disallow agents, you can disallow parts of your websites or only few pages... Or you can deny everything or allow everything.

Here is an example from http://www.robotstxt.org

User-agent: webcrawler Disallow:

User-agent: lycra Disallow: /

User-agent: * Disallow: /tmp Disallow: /logs

The Webcrawler Bot can go anywhere. The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.

The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.

Validator
Once you are done with your robots.txt file you should test it by using a robots.txt validator, there is one at this address: http://www.searchengineworld.com/cgi-bin/robotcheck.cgi. The Searchengineworld website provides also a more complete tutorial about the robots.txt file.

Notes - The use of this file may reduce bandwidth consommation by robots on your server. If you did disallow few pages, - It also cleans a little bit your log files (1 line less by bots scan), - The most important point is that this file is recommended by Bots for duplicate websites. As you may get penalties when you have duplicate sites, a solution is to deny access to one side.

Specific
Each Bot may act a little differently, so it's advised to check faqs from every bots to learn more about their indexing behaviour, for instance for the Yahoo slurp Bot, you can check this Url: Yahoo Slurp Index

The Msn bot: http://search.msn.com/docs/siteowner.aspx The Google bot: http://www.google.com/bot.html

Here is a database of webrobots : http://www.robotstxt.org/wc/active/html/contact.html

Warning
There are hackers who search the robots.txt files for directories and files which should not be scanned, they are also called 'bad robots'.

The solution is to either not mention the links and directories to avoid or to put them in a special place where you add an additional server protection.

Future of web agents Web agents job is becomming more and more complex as the web grows, although technology is improving, connections get faster and faster, cheaper and cheaper, cables are getting busier and busier though.

There are heaps of websites geting online everydays and the web agents must perform relevant indexing, do you remember the time Google was crawling a new website in 1 day '

I guess no !

I would not be surprised Web agents start to implement a kind of selection and automatically avoid websites which are not HTML valid... My advice: follow the rules, test your site, make it conform with today's search engines guidelines...

The robots.txt may help those agents understand your site, so use it, that will reward you later.

Thanks for reading, i hope this article has been useful for some of you.

Author: Etienne Peysson
 
Author Bio:
Etienne Peysson is a champion in this field. Etienne has written several articles in the past on this topic.
This article can be searched using: friendly robotics, first robotics, introduction of robotics, robotics history
 
 
 

Related Articles

 
A Bigger Surf - a Warmer Planet
 
Integrated VOICE IP in Unitone Communication Server
 
Web Conferencing
 
Noise Cancelling Headphones: Two Wrongs Make A Silence?
 
The Joys of Prepaid Cell Phone Plans
 
Property Monitoring Using Webcams
 
The Good and the Bad of Virtual Meetings
 
DIRECTV Satellite Television Dealers - Who's Got the Best Deal?
 
Contaminated Cement can be Controlled by Encapsulating With Epoxy
 
The Freedom Of Portable Satellite Radio
 
 
 
 

Calling Cards

You are often on the road traveling. As much as you want to stay home, you have no choice but to tra ... - Ken Marlborough
 

What is Satellite Radio?

Satellite Radio ? Its here! Satellite radio is a new service being offered by two companies, XM Sate ... - Scott Fish
 

DVR DirecTV: TiVo Vs. VCR

The VCR has been around since 1971 and has served a very important role in our daily lives. When DVD ... - Jay Carmichael
 
 

North Korea Long Range Missile Testing and USA Threat

North Korea has now publicly stated that it has a long-range missile that is capable of reaching the ... - Lance Winslow
 

Dish Network - the Very Basis Troubleshooting You Should Know

I believe most of us experienced malfunction of our dish network before. Before throwing out our tem ... - Shawn Daren
 

What is Rust, Anyway?

Rodger Busse - The Rust Doctor - explains exactly what happens to metal to make it rust. He also exp ... - Rodger Busse
 

What NASA Can Teach You About Your Business Goals

Learn from how NASA set the Space Program goals. Use this knowledge to build a sustainable business ... - Andy Warren
 

Nerve of Steel

Indian-born L.N. Mittal rules the steel world after his titanic takeover of European metal maker Arc ... - Rakesh K. Simha
 
 
Index Page >> Security & Privacy >> Terms of Service
Copyright © 2006-2008 www.starcrusaders.com - All Rights Reserved.