Skip links

Robots.txt

Definition

The robots.txt file, also known as the robots exclusion protocol, is used to prevent search engine crawlers from accessing certain files on your site. It acts as a guide for search engine bots, whether Googlebots, Bingbots, or Yandex bots, in crawling your site, by blocking them from accessing certain URLs. This is a file that you must include in your site if you want it to rank.

This file is placed at the top of a website. It is therefore one of the first files analysed by the crawlers. To check that it is there, simply type the address of your site in the search bar and add “/robots.txt” after it. If it is not there, a 404 error will be displayed.

Webmaster en pleine rédaction du robots.txt de son site

What is the purpose of a robots.txt file?

How this file works

This file makes it possible to prevent crawling, and therefore indexing:

  • From certain pages of your site to all search engine crawlers
  • Of your site in general by specific robots
  • From specific pages of your site to specific robots

In addition, the robots.txt file tells search engine crawlers about your sitemap, so they can easily find it.

Interest of the robots.txt file in SEO

The robots.txt file, if used properly, can help the SEO of your website. Indeed, it allows :

  • Prevent robots from indexing duplicate content
  • Provide crawlers with the sitemap of your site
  • Save the crawl budget of Google’s robots by excluding low quality or irrelevant pages from your website

When should I use a robot.txt file?

For e-commerce sites

It is a file that is used extensively, in particular for e-commerce sites, as it allows the problems of duplicate content arising from faceted searches to be resolved. This is a navigation method found on many e-commerce sites, which allows users to quickly find what they are looking for by filtering the different products offered on the site. However, such a search method leads to the creation of many pages with very similar content, because of the multitude of possible combinations between the different filters and categories. These pages then risk cannibalising each other, as well as diluting the PageRank captured by your strategic pages.

This is where the robot.txt file comes into play, which you can use to prevent these pages from being crawled by search engine robots, while leaving them accessible to users.

For specific pages of your website

This robot.txt file is also used to prevent Google from crawling some specific pages on your website:

  • Images
  • PDFs
  • Videos
  • Excel files

This is because these are pages that are generally used to attract leads. For example, if you want to obtain users’ details before giving them access to these documents, the robot.txt file allows you to block access to anyone who has not filled in the required information.

To keep parts of your site private

As a webmaster, there are probably some parts of your site that you want to keep private, such as certain personal files, or URL parameters.

To avoid overloading your website

Finally, this robot.txt file can be used to indicate a crawl delay to avoid your servers being overloaded by search engine crawlers. Indeed, when the crawlers explore several contents of your site at the same time, it can cause an overload of your servers which do not have the capacity to load so much content simultaneously.

How to create a robots.txt file?

The robots.txt file is either manually created or automatically generated by most CMS such as WordPress, and must be found at the root of a site. But you can also use online tools to create this file.

If you want to create your robots.txt file manually, you can use any text editor following certain rules:

  • Syntax and instructions : User-agent, disallow, and allow.
  • Calling your robots.txt file.
  • A structure to adopt: one instruction per line without leaving any empty.

Be careful, your robots.txt file must not be larger than 512 Kb.

Creating your robots.txt file with Rank Math

Creating your robots.txt file is very simple. To do so, you just have to go to the “general settings” tab of your plugin, then click on “edit robots.txt“. You can then write your file directly in the plugin, which will automatically integrate it into your site.

Please note that if you had already added a robots.txt file to your site before installing the Rank Math plugin, and you now want to manage it through the plugin, you must delete the file from your site before you can edit it in Rank Math.

Creating your robots.txt file with Yoast

You can also use the Yoast plugin to manage your robots.txt file. As with Rank Math, the way it works is quite simple. Go to the “referencing” menu of the plugin. Then click on “tools“, then on “edit file“. If you have not yet added the file to the plugin, you just have to create it by clicking on the associated button. All you have to do is make the desired changes before clicking on “save“.

What language to use for a robots.txt file

The robots.txt file uses a specific language in which there are a few regular expressions, called Regex, that allow you to simplify writing the robots.txt. Here are some of the common expressions.

User-agent:

This command allows you to use specific search engines. The robot.txt file is the first file that search engine robots scan. They then check that they have been mentioned in this file. If they see their name appear, they will then read the commands that have been assigned to them.

To mention a search engine, you simply insert its name after the User-agent command. For example, if you want to mention Google, you would write “User-agent: Googlebot“. Furthermore, if you wish to centralise all the commands addressed to the search engines by identifying them all at once, you only need to write the following command: “User-agent: * “.

Disallow: /

This command prevents crawlers from crawling certain parts of your website. However, you can only add one command per line, which is why there are multiple lines of “disallow” commands in a row in robots.txt files.

For example, if you put “Disallow: */catalog/” after your “User-agent: Googlebot” command, you are disallowing Google’s robots to visit all of your catalogue pages.

Allow: /

This command only applies to a Google crawler, called Googlebot, and gives it access to a page or sub-folder, even if it is denied access to its parent page.

For example, if you add the command “Allow: /wp-admin/admin-ajax.php” after the command “Disallow: /wp-admin/“, you allow the Googlebot to access the sub-folder “admin-ajax.php“, but not to crawl your “wp-admin” page.

Crawl-Delay:

The Crawl-Delay command allows you to ask crawlers to wait a few seconds before crawling your site. For example, by entering “Crawl-Delay: 20“, you ask the robots of the search engine(s) concerned to wait 20 seconds before entering your site.

Sitemap:

As you can imagine, this command allows you to directly indicate your sitemap to the crawlers. To do this, you simply insert the URL of your sitemap after the “Sitemap:” command.

Syntax of robots.txt

There are some syntax elements specific to robots.txt files that are important to know:

  • / : It allows you to separate the files. If you simply leave a “/” without adding the name of one of the files on your site, this means that the command concerns all of your pages. For example, the command “Disallow: /” means that you are blocking access to your entire site from the relevant search engine spiders.
  • * : The asterisk is used to encompass all the elements of a site related to one or more of the criteria indicated after it. For example, the following command “Disallow: *?filter=*” prevents search engines from accessing all URLs containing “?filter=”.
  • # : The hashtag allows you to add comments to your robots.txt file. This allows you to give additional information to any reader of the file, without the search engines mistaking it for instructions.
  • $ : The dollar sign allows you to give a directive concerning all URLs containing a certain element, regardless of the slugs preceding this element in the URL. For example, the instruction “Disallow: /solutions/$” prohibits search engines from accessing all URLs containing the element “/solutions/“, regardless of the slugs following it.

Some tips for optimising this file

To optimise your robots.txt file, it is important to adopt some good practices:

  • Make sure you do not block the URLs of your website that you want to be indexed
  • Keep in mind that links placed on blocked pages will not be followed
  • Do not use the robots.txt file to block the display of sensitive data in the SERP. Indeed, this file does not systematically prevent the indexing of blocked URLs, as these pages can very well be indexed if other sites or pages point to them
  • Some search engines have several crawlers. Specifying directives for each of these crawlers is not mandatory, but it helps to refine your content analysis

To ensure that all your important URLs are indexable by Google, you can test your robots.txt file. To do this, simply register your site with Google Search Console, then click on Crawl in the menu, then click on Robots.txt Testing Tool.

Limitations of the robots.txt file

However, this robots.txt file has some limitations in its utility. These limitations are as follows:

  • The directives in this robots.txt file are not compatible with all search engines: This file does give instructions, but it is up to the crawler to obey them or not. This is why it is advisable to use other blocking methods to protect certain data on your website. For example, you can protect the private files on your site with a password.
  • Not all crawlers interpret syntax in the same way: It is therefore a challenge to find the right syntax for all crawlers to understand your guidelines.
  • A page that is disallowed in the robots.txt file can still be indexed if other sites point to it.

Boost your Visibility

Do not hesitate to contact us for a free personalised quote

4.8/5 - (13 votes)