The robots.txt file is a file that tells spiders and web robots what is acceptable to crawl.  When a site if first crawled, it first checks the domain (www. mywebsite.com/robots.txt) to see what is allowed and disallowed.

The robots.txt file looks something like this:

User-agent: *
Disallow: /

The “user-agent:” identifies the who and what is allowed.  In this case the “*” represents a wild card.  This means it applies to ALL robots.

The “Disallow:” identifies the location that is off limits to the spiders or web robots.  In this case, the “/” means the main directory.

The above statement basically tells the spiders to disallow all robots from crawling the main directory.

Another example of this would be:

User-agent: *
Disallow: /~ted/homework
Disallow: /images/2009

OR

User-agent: Adsbot-Google
Disallow: /financial-records

Now here is the kicker.  Just because you have requested that Google crawl certain information on your site, does that mean it wont show up in the search engines?  The answer is not necessarily.  If you tell Google not to crawl an images file, yet some of those images are linked somewhere else on the internet, Google can still make the association with the links.  Matt Cutts from Google does a nice job of explaining this.

The robots.txt file can be used to shape the content that you want the search engines to associate with your site.  Here is an example of how New York Times uses it.