Robots.txt and SEO – The Ultimate Guide from the Experts

Written by Tiago Silva. Updated on 06, September 2024

Your “robots.txt” file, one of the most important files within your website, is a file that lets search engine crawlers know if they should crawl a web page or leave it alone.

One of the main aims of a robots.txt file is to ensure your website servers are not overloaded with search engine bots crawling your website at all hours of the day, but did you know that there are also SEO-related benefits that come with a properly optimised robots file?

Keep reading this guide to understand more about robots.txt, the best practices, and how this helps your website’s SEO efforts.

 

What Does the Robots.txt File Do?

A common misconception is that a robots.txt file prevents a page from getting indexed by search engines.

That’s not true.

Google says, “If other pages point to your page with descriptive text, Google could still index the URL without visiting the page”. If Google indexes a page disallowed by robots.txt, it won’t have a description because they weren’t allowed to visit the page.

When you want to prevent a page from being indexed, use the noindex tag. Alternatively, lock the page with a password to prevent Google from indexing private information.

The default assumption is that crawlers can crawl, index and rank all of the pages on your website unless they are blocked with a disallow directive (more information in therobots.txt syntax and formatting section below). So, if a robots.txt file doesn’t exist or isn’t accessible, search engines will crawl, index and rank every single page they can find on your website as if there were not any restrictions in place.

Robots.txt Syntax and Formatting

Robots exclusion standard (aka. Robots.txt) should follow a set of rules to be valid and used by search engine crawlers.

Robots.txt file on Money Saving Heroes website.

The structure of a robots.txt file includes:

  • Group: Names the user-agent and directives that it must follow. A robots.txt file can have as many groups as you want, but most user-agents will only follow the first groups that apply to them.
  • User-Agent: The identification of a crawler. Naming the user-agent in the robots.txt the user-agent the directives set for them. For example, you can name Googlebot as one user-agent and name the Pinterest bot as another.
  • Directives: These are the instructions that each user-agent named in the same group has to follow.
  • XML Sitemap: It’s possible and common for robots.txt files to mention the sitemap because it makes it easy for search engine crawlers to find the file.

Group of directives on robots.txt from Money Saving Heroes.

Crawlers usually process groups from top to bottom, but Googlebot and Bingbot will default for the most specific rules as they are usually less restrictive.

User agents can only follow one group, and you should avoid having contradictory directives for the same user agent. If a group of directives targets a user-agent more than once, they will likely ignore it, following only the first group they found on the robots.txt.

You can also use the robots exclusion file on subdomains (for example, www.mydomain.com/robots.txt or blog.mydomain.com/robots.txt) or non-standard ports (mydomain.com:8181/robots.txt).

The main rules of the robots.txt file:

  • Must be UTF-8 encoded.
  • Must be named “robots.txt”.
  • Must be located on the root of the domain.
  • It will only be valid for the same protocol (HTTP or HTTPS) and subdomain (for example, www or non-www) where it’s located.
  • Should use only relative paths (except for the sitemap).
  • Must have only one directive per line.
  • Directives are case-sensitive.
  • Comments start with # and don’t get read by crawlers.

Robots.txt Directives

Different search engine crawlers, no matter whether it is Googlebot or Bingbot, for example, will follow each directive relevant to them within a robots.txt file to ensure they understand which pages on a website they are allowed to navigate to. This allows them to understand which pages on a website can be crawled, indexed, and ranked.

It is worth noting that not all crawlers support the same directives or even interpret the syntax of a directive in the same way.

Googlebot is one of those user agents that doesn’t support all directives. But before explaining those in detail, let’s first see the list of all directives:

  • Sitemap.
  • Disallow.
  • Allow.
  • Crawl-delay.
  • Noindex.
  • Nofollow.

All directives, except the sitemap, support wildcards from RegEx for the entire string, prefix, or suffix. The directives should start with a slash (“/”) when referring to a page and finish with “/” when referring to a directory.

Sitemap Directive

The sitemap directive shows the URL where the XML sitemap of a website is, thus making it easier for crawlers to find them. This directive is supported both inside and outside of groups. Unless you have a specific sitemap for a particular bot, it’s better to declare it at the beginning of the robots.txt, so all crawlers can use it.

As mentioned in our guide to XML Sitemaps, they aren’t mandatory, and if you’ve already submitted it on Google Search Console this directive can be redundant. However, declaring the XML sitemap doesn’t hurt you and makes it easier for other user agents like Bingbot to find it.

Example showing usage for Sitemap directive:

Sitemap: https://mydomain.com/sitemap.xml

User-agent: *

Disallow: /admin

Disallow Directive

The disallow directive tells crawlers they aren’t allowed to visit the URL or matching expression (when using RegEx). This is the directive you would be using more frequently in your robots.txt file as, by default, there aren’t limitations on pages bots can visit.

In the example below the disallow directive is not allowing bots to crawl the admin pages on a WordPress site, and it would look like this:

User-agent: *

Disallow: /wp-admin/

The reason most (if not all) websites will have this particular directive in place is because site owners, understandably, do not want to have links to their admin login areas indexed and ranked. You can see the problem this may cause, right?

Allow Directive

The allow directive tells crawlers they can visit and crawl a URL or matching RegEx. This rule is mainly used to overwrite a disallow directive when you want bots to crawl a page from a blocked directory.

An example could be allowing crawlers to visit the login page but not all the admin pages of a WordPress site.

User-agent: *

Disallow: /admin/

Allow: /admin/login

Crawl-Delay Directive

The crawl-delay directive limits how frequently crawlers visit URLs to avoid overloading servers. Not all crawlers support this directive, and they can interpret the number of the crawl-delay differently.

Example:

User-agent: *

Crawl-delay: 1

In the example above, Bingbot would interpret that they should wait 1 second before crawling a URL. One of the biggest issues caused when this directive is not set up correctly is a particular search engine crawler, like Bingbot for example, may cause too much strain on your website’s server and crash your entire website.

Noindex Directive

The noindex directive in robots.txt prevents URLs from getting indexed. However, Google ended support for it in 2019 as they never documented it. Gary Illyes mentioned that one of the reasons for the removal was that websites “were hurting themselves” with the noindex.

Don’t believe me? I have seen plenty of examples where websites have noindexed entire sections of their websites, causing them (in almost all cases) to not be ranked on a particular search engine.

One of the main reasons for this is staging websites. Naturally, whilst a website is being built in a staging environment, almost all of the pages are noindexed to ensure Google and other search engines leave them alone during the build. Once a website goes live, it’s entirely possible for someone to forget to edit the robots.txt file and leave sections of the website set to noindex.

Yes. This really is an SEO professional’s worst nightmare!

In practical terms, this directive worked similarly to the noindex tag. The difference is that this directive centralized instructions in one place (robots.txt) instead of having to create a noindex tag on each page.

In this example, the robots.txt requests crawlers not to index the about page:

User-agent: *

Noindex: /about

Nofollow Directive

The nofollow directive tells crawlers to not follow links in a URL. This is similar to what the nofollow tag does, but instead of doing it for a link, it applies to every URL in the page. Google doesn’t support this directive as they announced in 2019 (same announcement as noindex above).

Also, Google employees told people not to use this directive for a long time before that announcement.

In this example, the robots.txt requests crawlers not to follow links on the about page:

User-agent: *

Nofollow: /about

Supported Robots.txt Directives by Google Crawlers

Googlebot only supports the following robots.txt directives:

  • User-agent.
  • Disallow.
  • Allow.
  • Sitemap (when mentioned outside a group).

Per Google documentation, all Adsbot should be explicitly named as a user agent. So, using the wildcard (*) will not include Google Adsbot.

How Google Interprets Robots.txt Directives

Google has extensive documentation about how its crawlers interpret directives from robots.txt files.

Summary of Googlebot interpretation of directives:

  • The robots.txt file should be UTF-8 and smaller than 500kb.
  • The robots.txt file will only be considered valid for the same domain (same protocol, subdomain, host, or port number).
  • They don’t check robots.txt on subdirectories when crawling the root domain. For example, when they are looking at www.amazingcompany.com they won’t look at blog.amazingcompany.com. This is crucial as it will mean you needing a separate robots.txt file for any pages or parts of your site that are utilising a subdomain.
  • When robots.txt returns 4XX codes other than 429, Google considers there aren’t crawling limitations.
  • For 5XX and 429 codes, Google considers the site disallowed but will keep trying “to crawl the robots.txt file until it obtains a non-server-error HTTP status code”. If, after 30 days, the file is still inaccessible, they will consider there aren’t restrictions.
  • When Google thinks the site is configured to send 503 errors for robots.txt location instead of 404s when the file doesn’t exist, they will interpret the 503 as a 404.
  • Google will ignore invalid lines in the file, “including the Unicode Byte Order Mark (BOM)”.
  • Spaces at the beginning and end of each line are ignored. However, they are recommended to “improve readability”.
  • Googlebot will automatically ignore comments.
  • Googlebot will ignore directives without a path.
  • URL paths should be relative to the folder of the robots.txt file (i.e., the root).
  • XML Sitemap directive should use an absolute path.
  • URL paths are case sensitive.
  • It’s allowed to mention more than one XML sitemap directive.
  • Googlebot tries to follow the most specific group of directives. If there’s more than one valid group, the user agent will group it. “User agent specific groups and global groups (*) are not combined.”.
  • Robots.txt will only pay attention to allow, disallow, and user-agent inside groups.
  • In case of conflicting rules, Google uses the least restrictive directive.

Visit the documentation for detailed examples with the order of precedence for these rules.

Robots.txt Example Rules

Whilst it is entirely possible to create many rules inside one robots.txt file, some rules are a lot more useful than others! This section will focus on some of the most valuable rules you need to learn.

Disallow Crawling a Directory

User-agent: *

Disallow: /grandma-recipes/

In this example, the robots.txt prevents bots from crawling all the pages in the grandma-recipes folder.

Block Access to a Single Crawler

User-agent: annoying-bot

Disallow: /

In this example, the user agent called annoying-bot isn’t allowed to crawl any page on the site. Please remember that crawlers aren’t obliged to follow these directives, but the respectable ones will.

Disallow Crawling of a Single Page

User-agent: *

Disallow: /best-grandma-cookies

In this example, no crawler is allowed to visit the page titled “Best grandma cookies”. It is also worth noting that this will work in the same way as the directory directive we explained a little earlier, so any page under this path will be blocked too.

Block all images on your site from Google Images

User-agent: Googlebot-Image

Disallow: /

This group forbids Google’s image bot from visiting any page on your website, so it is incredibly useful if you do not want any of the images on your website appearing on Google Images. Keep in mind, however, that this is purely Google’s image crawler, so will not stop image bots from other search engines from visiting your images and possibly ranking them.

Block a Specific Image from Google Images

User-agent: Googlebot-Image

Disallow: /images/cookies.jpg

It’s possible to block only specific images from getting crawled by bots. To do that, use a disallow directive and the path to the file in the server. Great if particular images display information you do not want listed on Google Images, but are fine with other images on your website being indexed.

Disallow Crawling a Specific File Type

User-agent: *

Disallow: /*.pdf$

Blocking specific file types is also possible. For example, to block 1 file type on the site, use a regular expression with a wildcard (*), mention the file extension (PDF, in this example) and finish the expression with a dollar sign($). The dollar sign indicates the end of the URL.

Best Practices

My advice is that you should follow the standard structure and formats of robots.txt mentioned in this guide, but below are some tips for using robots.txt more efficiently.

Using Regex to Simplify Directives

The robots.txt file supports the use of RegEx. This will make declaring the instructions in the file simpler because you can group instructions into one expression instead of writing one directive for each URL.

A practical example is search parameters on a site. Without RegEx, you would write the following directives:

User-agent: *

Disallow: /cookies/chocolate?

Disallow: /cookies/cream?

Disallow: /cakes/strawberry?

Disallow: /cakes/vanilla?

With RegEx, the robots.txt file would look more like this:

User-agent: *

Disallow: /cookies/*?

Disallow: /cakes/*?

The wildcard (*) will include all the cookies and cakes directory pages in this example. It would become a burden for websites with several directories to maintain the document, thus using RegEx is more efficient.

Use Each User Agent Only Once

Most crawlers read the robots.txt from top to bottom and follow the first applicable group for their user agent, and those user agents will only follow one group of directives. If you mention a crawler more than once, they will ignore one of the groups. That isn’t the case with Google and Bingbot, as they follow the more specific rules, and in Google’s case, they will bundle the groups as if it’s only one.

However, to avoid confusion, it’s better to list the specific user agents at the top and put the group with the wildcard for all non-mentioned crawlers at the bottom.

Be Specific with Directives

Being specific in the robots.txt pays off and prevents unintentional consequences of bots not crawling essential sections on your site.

Imagine you don’t want bots to crawl the cookies folder, and you create the following disallow rule:

User-agent: *

Disallow: /cookies

This disallow rule (“/cookies”) will also prevent bots from crawling the following URLs:

/cookies-recipes.html

/cookies-best-recipes-for-download.pdf

The solution would be to fix the directive and mention /cookies/, like this:

User-agent: *

Disallow: /cookies/

FAQs

Do you need to use robots.txt?

Robots.txt isn’t a ranking factor or a requirement for having good organic results, and most websites won’t notice any difference if they don’t use one.

However, using a robots.txt file becomes more important as the website grows and has more pages. So for big websites, robots.txt matter more and can have a significant influence!

As mentioned above, the robots.txt will help manage traffic from crawlers by managing the crawl budget (internal link) or preventing server overload.

What are the limitations of a robots.txt file?

The main limitations of robots.txt are the following:

  • Not all search engine crawlers support the same directives.
  • Different crawlers interpret syntax in the file differently.
  • Disallowed pages can still be indexed.
  • Badly behaved bots can ignore the instructions of robots.txt.

What happens if you don’t use robots.txt?

If you don’t use a robots.txt file, crawlers will interpret that there aren’t limitations on the pages they can visit and index from your site. That’s the default behavior in the absence of a robots.txt file.

As Google mentions in their documentation, if a robots.txt is inaccessible, they would act as if the file doesn’t exist and crawl every page on the website.

One way to put this, simply, is if there are pages on your website that you do not want search engine bots to visit and crawl, you need to ensure these are all listed within a robots.txt file.

What happens if you disallow a page with a noindex tag?

Google won’t see a page with the noindex tag if it’s disallowed by robots.txt. So by combining disallow and noindex, you tell Google that you don’t want them to visit the page, but as they can’t visit the page, they won’t see your instruction not to index it.

If you don’t want the page indexed, it’s better to use the noindex tag because the instruction is more specific to your goal.

How to find errors on robots.txt on Google Search Console?

Visit the Coverage report on Google Search Console to find any error or blocked page currently indexed. The pages with errors will be under the Excluded status at “Blocked by robots.txt”. You should look here to see if any pages are unintentionally blocked.

You would still find “Indexed, though blocked by robots.txt” on the Valid with warning status. These are potentially the pages that you don’t want Google to index. A solution is to allow the crawling of the pages and use a noindex tag. To fix this solution faster, manually ask Google to crawl the page after putting the noindex tag on it.