The robots.txt explained

Web robots, for example those used by search engines, will be constantly crawling your website looking to gather information about its content. This file acts as an exclusion protocol, telling the robot which subdirectories to look in and which to ignore.

How does a robots.txt work?

Part of Google's robots.txt

Part of Google’s robots.txt

There are two main parts the file: the user agent and the disallow list. The user agent refers to the robot you want to be able to scan your site, whilst the disallow list shows which directory they can and cannot look in. To allow access from all robots, a “*” would be used, and to disallow access to all directories, a “/” would be used.

The below would exclude all robots from looking at files on your entire server:

User-agent: *
Disallow: /

If you wanted only one robot, in this example Google, to be allowed to access your server, and for every other robot to be excluded, you would put the allowed robot first. Note that that there is nothing entered after the first instance of disallow, giving the Google robot complete access to the server:

User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

To exclude robots from certain files or subdirectories, you will need to name each individually:

User-agent: *
Disallow: /cgi/
Disallow: /temp.html

You can find a list of all known web robots at this place.

Where to put a robots.txt?

No matter what page of your website the robot is looking at, it will remove all parts of the URL after the first single slash and look for your robots.txt file here. The file should therefore be in your top-level directory, usually where your index.html file is stored.

Just like this: http://example.org/robots.txt

Why to use a robots.txt?

The file can be used to block access to directories such as images, which robots may use up a lot of bandwidth scanning. By not having a robots.txt file, your logs will also show an error each time a robot tries to access it, which whilst not a problem in itself, can clog up your logs making it more difficult to find genuine errors. Depending on the content of your website, a robots file can improve your search engine ranking.

Common mistakes when using a robots.txt?

A common mistake is assuming that the protocol will be obeyed. Whilst the major search engines will obey what they are told, malicious web robots that scour the web for e-mail addresses and vulnerabilities will ignore the file, and in some cases even go straight to the disallowed sections, so nothing within them will be truly hidden.

Even if you have disallowed a page from being accessed by robots, it is still possible that the URL may be indexed by a search engine if it is mentioned or linked to elsewhere on the Internet. You can stop that by using the meta tag robots and the noindex attribute in your head section of the regarding page like this f.e.: <meta name="robots" value="noindex">

Why is a robots.txt important for a website’s rankings in search engines?

If you have worked hard improving the SEO of certain pages, you may find that search engines are able to categorise and therefore rank your website better if it is not scanning your less relevant pages, for example your terms and conditions.

Nice to know

If you are using a XML sitemap you can tell a web robot where to find it. Just add the following line to your robots.txt (assuming that sitemap.xml is the name of your sitemap):

Sitemap: http://example.org/sitemap.xml