Posted December 23rd, 2007 by admin 7 Comments »
Everyday hundreds of web robots and search engine crawlers set out to accomplish a huge task- that of visiting billions of pages in the internet, be it Google’s bot indexing all our pages and the rest of the web or the bad robots called spam bots hunting down every email addresses it could find to steal it.
For the most part, we love it when Google pays us a visit to index our content! Knowing what Google and others are getting however means we’re taking an extra step to direct them only to the content we want indexed. Sometimes there are areas in our directory where we don’t want others to see like our temp folder. To save bandwidth, we may want images, stylesheets or other files from being indexed too. For confidential files on our site, like a database of names and addresses of contacts, of course, it is best to just put it offline or onto another machine than risking spreading it on the net.
Comes the term Robots Exclusion Protocol (REP). Think of this as a sign to our office where it says, restricted area. That means for employees access only and meant to drive away unwanted visitors. /robots.txt works just like that. There is another REP to place in META tag that works the same way. We will discuss it in our next post. We’ll talk about the former first.
What is /robots.txt?
/robots.txt is a simple text file. It’s not an HTML, just a basic text file that can do wonders! It instructs robots which pages we would NOT want them to visit. It is not required of them to follow so but generally good robots and crawlers are courteous enough to comply with what is asked of them. It is important to note nonetheless that as in the above comparison of a restricted area, it’s just a sign to an unlocked door. It doesn’t mean that the unwanted visitor can’t get in when he wants to! Bad robots like spam bots and malware bots may still get through the door to look for loopholes in your security and those email addresses but the good bots will definitely abide with the sign and will not barge in uninvited.
As mentioned earlier, it is risky to place sensitive files on your directory and hope that robots.txt will protect it from being indexed and appearing in search results. /robots.txt is also public and may be accessed by anyone and it sees exactly what sections you don’t want robots to see so that you don’t want a filename like /mybankaccounts on the /robots.txt included. It just tells them you can’t view mybankaccounts folder but if you know a way to get into it, you can!
What does robots.txt look like and how does it work?
The concept of robots.txt is this: a robot wants to visit the site http://www.myownsite.com/welcome.html. Before it does anything, it first looks for http://www.myownsite.com/robots.txt, to find out which pages it can index or not. If it can’t find the filename, it will go ahead and index everything on that directory.
This is the basic structure of a robots.txt file where * (asterisk) means ALL robots and / (slash) means all pages should not be indexed. As a file it means: ALL robots are NOT allowed to index any of the pages. We don’t want that but just so you know the basic component, Disallow: /thenfilename .
On the other hand this below says to allow all robots to index all pages. This is usually the default for all websites unless we manually create robots.txt to include files we don’t want indexed.
If you don’t have really much yet on your site, it is best to just do the above or simply create a robots.txt file and leave it empty or just not do anything. The /robots.txt works for those who have files in their directories that they don’t want to be indexed; files they don’t want to see appear on searches.
To save bandwidth and there’s really no point in having folders like our images or cgi-bin or other files from being indexed, we create this below which means you’re allowing all robots to index your pages except the one listed on the Disallow.
If you want to be specific and only allow google to search your directory, you may do so with this:
It means you’re allowing Google to index your pages except the cgi-bin and privatedir folders. Note however that if you do this, you’re not allowing MSN, Yahoo or Alexa to index your site. So you might want to reconsider doing so.
The samples above should be used as it is. Be careful with spelling, missing colons and placements. For example, writing Disalow instead of Disallow or User Agent instead of User-agent. Also the filename is robots.txt not Robots.Txt.
Where to place the robots.txt file?
Placement of the robots.txt file is very important since wrong placement means the robots and search crawlers won’t be able to find it, hence will most likely index ALL your pages. They don’t have all day to look for robots.txt file on our files. The only place to put is on the root directory of the site, not on folders, not on sub-directories. To check for your robots.txt file, just place this on the URL tab of the browser: http://myownsite.com/robots.txt
We shared only the basic information on robots.txt. To learn more about robots.txt, visit robotstxt.org. This page will bring you to the Robots Database. A list of over 300 identified robots wandering in the internet everyday.