This blog is NOFOLLOW Free!

robots.txt: How Google and other Search Engines Find Your Website


robotstxt

Everyday hundreds of web robots and search engine crawlers set out to accomplish a huge task- that of visiting billions of pages in the internet, be it Google’s bot indexing all our pages and the rest of the web or the bad robots called spam bots hunting down every email addresses it could find to steal it.

For the most part, we love it when Google pays us a visit to index our content! Knowing what Google and others are getting however means we’re taking an extra step to direct them only to the content we want indexed. Sometimes there are areas in our directory where we don’t want others to see like our temp folder. To save bandwidth, we may want images, stylesheets or other files from being indexed too. For confidential files on our site, like a database of names and addresses of contacts, of course, it is best to just put it offline or onto another machine than risking spreading it on the net.

Comes the term Robots Exclusion Protocol (REP). Think of this as a sign to our office where it says, restricted area. That means for employees access only and meant to drive away unwanted visitors. /robots.txt works just like that. There is another REP to place in META tag that works the same way. We will discuss it in our next post. We’ll talk about the former first.

What is /robots.txt?

/robots.txt is a simple text file. It’s not an HTML, just a basic text file that can do wonders! It instructs robots which pages we would NOT want them to visit. It is not required of them to follow so but generally good robots and crawlers are courteous enough to comply with what is asked of them. It is important to note nonetheless that as in the above comparison of a restricted area, it’s just a sign to an unlocked door. It doesn’t mean that the unwanted visitor can’t get in when he wants to! Bad robots like spam bots and malware bots may still get through the door to look for loopholes in your security and those email addresses but the good bots will definitely abide with the sign and will not barge in uninvited.

As mentioned earlier, it is risky to place sensitive files on your directory and hope that robots.txt will protect it from being indexed and appearing in search results. /robots.txt is also public and may be accessed by anyone and it sees exactly what sections you don’t want robots to see so that you don’t want a filename like /mybankaccounts on the /robots.txt included. It just tells them you can’t view mybankaccounts folder but if you know a way to get into it, you can!

What does robots.txt look like and how does it work?

The concept of robots.txt is this: a robot wants to visit the site http://www.myownsite.com/welcome.html. Before it does anything, it first looks for http://www.myownsite.com/robots.txt, to find out which pages it can index or not. If it can’t find the filename, it will go ahead and index everything on that directory.

This is the basic structure of a robots.txt file where * (asterisk) means ALL robots and / (slash) means all pages should not be indexed. As a file it means: ALL robots are NOT allowed to index any of the pages. We don’t want that but just so you know the basic component, Disallow: /thenfilename .

disallow

User-agent: *
Disallow: /

On the other hand this below says to allow all robots to index all pages. This is usually the default for all websites unless we manually create robots.txt to include files we don’t want indexed.

allowall

User-agent: *
Disallow:

If you don’t have really much yet on your site, it is best to just do the above or simply create a robots.txt file and leave it empty or just not do anything. The /robots.txt works for those who have files in their directories that they don’t want to be indexed; files they don’t want to see appear on searches.

To save bandwidth and there’s really no point in having folders like our images or cgi-bin or other files from being indexed, we create this below which means you’re allowing all robots to index your pages except the one listed on the Disallow.

allbots

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /privatestuff
Disallow: /temp/

If you want to be specific and only allow google to search your directory, you may do so with this:

googlebot only

User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/

It means you’re allowing Google to index your pages except the cgi-bin and privatedir folders. Note however that if you do this, you’re not allowing MSN, Yahoo or Alexa to index your site. So you might want to reconsider doing so.

The samples above should be used as it is. Be careful with spelling, missing colons and placements. For example, writing Disalow instead of Disallow or User Agent instead of User-agent. Also the filename is robots.txt not Robots.Txt.

Where to place the robots.txt file?

Placement of the robots.txt file is very important since wrong placement means the robots and search crawlers won’t be able to find it, hence will most likely index ALL your pages. They don’t have all day to look for robots.txt file on our files. The only place to put is on the root directory of the site, not on folders, not on sub-directories. To check for your robots.txt file, just place this on the URL tab of the browser: http://myownsite.com/robots.txt

We shared only the basic information on robots.txt. To learn more about robots.txt, visit robotstxt.org. This page will bring you to the Robots Database. A list of over 300 identified robots wandering in the internet everyday.

To create and validate your robots.txt file, Clockwatchers can help! Motoricerca is also a robots.txt checker.

Related Posts with Thumbnails Social Bookmarking
Add to: Yigg Add to: Digg Add to: Del.icio.us Add to: Reddit Add to: Jumptags Add to: Upchuckr Add to: Simpy Add to: StumbleUpon Add to: Slashdot Add to: Netscape Add to: Yahoo Add to: Google Add to: Blinklist Add to: Blogmarks Add to: Technorati Add to: Newsvine Add to: Blinkbits Add to: Ma.Gnolia Add to: Smarking Add to: Netvouz Information
This entry was posted on Sunday, December 23rd, 2007 at 1:19 am and is filed under Blogging, Recommended Reading, SEO, Tools. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

You might also want to read:

  • Search engine list
  • Free search engine submission
  • REP META Tag: How Google and other Search Engines Find Our Website
  • Search engine promotion
  • Job search engine
  • 7 Responses to “robots.txt: How Google and other Search Engines Find Your Website”

    1. EarnBlogger Says:

      Merry Christmas! Your article points towards an important point. We all must know our robots.txt file! Now, I’m trying to change it! Thanks.

    2. REP META Tag: How Google and other Search Engines Find Our Website | blogsthatfollow.com Says:

      [...] other Search Engines Find Our Website Posted in December 30th, 2007 by admin in Blogging, SEO On a recent post we discussed about Robots Exclusion Protocol (REP) /robots.txt and how it is used to instruct [...]

    3. admin Says:

      glad you found our post helpful EarnBlogger! Thank you for dropping a note. All the best for the new year to you!

    4. Lordvader Says:

      Today i learned sometthing new, thanks.

    5. david stern Says:

      Stumbled upon your blog and read a few of your postings.. Nice blog. Keep up the good work. Looking forward to reading more from you in the future.

    6. john daily Says:

      Thanks for sharing informations.Looking forward to more stuff.

    7. Mark Kinch Says:

      Great informations.Thanks for sharing…I keep reading.

    Leave a Reply

    Please leave these two fields as-is:

    Protected by Invisible Defender. Showed 403 to 129,122 bad guys.

    Comments links could be nofollow free.