Written by Tiago Silva. Updated on 06, September 2024
Your “robots.txt” file, one of the most important files within your website, is a file that lets search engine crawlers know if they should crawl a web page or leave it alone.
One of the main aims of a robots.txt file is to ensure your website servers are not overloaded with search engine bots crawling your website at all hours of the day, but did you know that there are also SEO-related benefits that come with a properly optimised robots file?
Keep reading this guide to understand more about robots.txt, the best practices, and how this helps your website’s SEO efforts.
A common misconception is that a robots.txt file prevents a page from getting indexed by search engines.
That’s not true.
Google says, “If other pages point to your page with descriptive text, Google could still index the URL without visiting the page”. If Google indexes a page disallowed by robots.txt, it won’t have a description because they weren’t allowed to visit the page.
When you want to prevent a page from being indexed, use the noindex tag. Alternatively, lock the page with a password to prevent Google from indexing private information.
The default assumption is that crawlers can crawl, index and rank all of the pages on your website unless they are blocked with a disallow directive (more information in therobots.txt syntax and formatting section below). So, if a robots.txt file doesn’t exist or isn’t accessible, search engines will crawl, index and rank every single page they can find on your website as if there were not any restrictions in place.
Robots exclusion standard (aka. Robots.txt) should follow a set of rules to be valid and used by search engine crawlers.
The structure of a robots.txt file includes:
Crawlers usually process groups from top to bottom, but Googlebot and Bingbot will default for the most specific rules as they are usually less restrictive.
User agents can only follow one group, and you should avoid having contradictory directives for the same user agent. If a group of directives targets a user-agent more than once, they will likely ignore it, following only the first group they found on the robots.txt.
You can also use the robots exclusion file on subdomains (for example, www.mydomain.com/robots.txt or blog.mydomain.com/robots.txt) or non-standard ports (mydomain.com:8181/robots.txt).
The main rules of the robots.txt file:
Different search engine crawlers, no matter whether it is Googlebot or Bingbot, for example, will follow each directive relevant to them within a robots.txt file to ensure they understand which pages on a website they are allowed to navigate to. This allows them to understand which pages on a website can be crawled, indexed, and ranked.
It is worth noting that not all crawlers support the same directives or even interpret the syntax of a directive in the same way.
Googlebot is one of those user agents that doesn’t support all directives. But before explaining those in detail, let’s first see the list of all directives:
All directives, except the sitemap, support wildcards from RegEx for the entire string, prefix, or suffix. The directives should start with a slash (“/”) when referring to a page and finish with “/” when referring to a directory.
The sitemap directive shows the URL where the XML sitemap of a website is, thus making it easier for crawlers to find them. This directive is supported both inside and outside of groups. Unless you have a specific sitemap for a particular bot, it’s better to declare it at the beginning of the robots.txt, so all crawlers can use it.
As mentioned in our guide to XML Sitemaps, they aren’t mandatory, and if you’ve already submitted it on Google Search Console this directive can be redundant. However, declaring the XML sitemap doesn’t hurt you and makes it easier for other user agents like Bingbot to find it.
Example showing usage for Sitemap directive:
Sitemap: https://mydomain.com/sitemap.xml
User-agent: *
Disallow: /admin
The disallow directive tells crawlers they aren’t allowed to visit the URL or matching expression (when using RegEx). This is the directive you would be using more frequently in your robots.txt file as, by default, there aren’t limitations on pages bots can visit.
In the example below the disallow directive is not allowing bots to crawl the admin pages on a WordPress site, and it would look like this:
User-agent: *
Disallow: /wp-admin/
The reason most (if not all) websites will have this particular directive in place is because site owners, understandably, do not want to have links to their admin login areas indexed and ranked. You can see the problem this may cause, right?
The allow directive tells crawlers they can visit and crawl a URL or matching RegEx. This rule is mainly used to overwrite a disallow directive when you want bots to crawl a page from a blocked directory.
An example could be allowing crawlers to visit the login page but not all the admin pages of a WordPress site.
User-agent: *
Disallow: /admin/
Allow: /admin/login
The crawl-delay directive limits how frequently crawlers visit URLs to avoid overloading servers. Not all crawlers support this directive, and they can interpret the number of the crawl-delay differently.
Example:
User-agent: *
Crawl-delay: 1
In the example above, Bingbot would interpret that they should wait 1 second before crawling a URL. One of the biggest issues caused when this directive is not set up correctly is a particular search engine crawler, like Bingbot for example, may cause too much strain on your website’s server and crash your entire website.
The noindex directive in robots.txt prevents URLs from getting indexed. However, Google ended support for it in 2019 as they never documented it. Gary Illyes mentioned that one of the reasons for the removal was that websites “were hurting themselves” with the noindex.
Don’t believe me? I have seen plenty of examples where websites have noindexed entire sections of their websites, causing them (in almost all cases) to not be ranked on a particular search engine.
One of the main reasons for this is staging websites. Naturally, whilst a website is being built in a staging environment, almost all of the pages are noindexed to ensure Google and other search engines leave them alone during the build. Once a website goes live, it’s entirely possible for someone to forget to edit the robots.txt file and leave sections of the website set to noindex.
Yes. This really is an SEO professional’s worst nightmare!
In practical terms, this directive worked similarly to the noindex tag. The difference is that this directive centralized instructions in one place (robots.txt) instead of having to create a noindex tag on each page.
In this example, the robots.txt requests crawlers not to index the about page:
User-agent: *
Noindex: /about
The nofollow directive tells crawlers to not follow links in a URL. This is similar to what the nofollow tag does, but instead of doing it for a link, it applies to every URL in the page. Google doesn’t support this directive as they announced in 2019 (same announcement as noindex above).
Also, Google employees told people not to use this directive for a long time before that announcement.
In this example, the robots.txt requests crawlers not to follow links on the about page:
User-agent: *
Nofollow: /about
Googlebot only supports the following robots.txt directives:
Per Google documentation, all Adsbot should be explicitly named as a user agent. So, using the wildcard (*) will not include Google Adsbot.
Google has extensive documentation about how its crawlers interpret directives from robots.txt files.
Summary of Googlebot interpretation of directives:
Visit the documentation for detailed examples with the order of precedence for these rules.
Whilst it is entirely possible to create many rules inside one robots.txt file, some rules are a lot more useful than others! This section will focus on some of the most valuable rules you need to learn.
User-agent: *
Disallow: /grandma-recipes/
In this example, the robots.txt prevents bots from crawling all the pages in the grandma-recipes folder.
User-agent: annoying-bot
Disallow: /
In this example, the user agent called annoying-bot isn’t allowed to crawl any page on the site. Please remember that crawlers aren’t obliged to follow these directives, but the respectable ones will.
User-agent: *
Disallow: /best-grandma-cookies
In this example, no crawler is allowed to visit the page titled “Best grandma cookies”. It is also worth noting that this will work in the same way as the directory directive we explained a little earlier, so any page under this path will be blocked too.
User-agent: Googlebot-Image
Disallow: /
This group forbids Google’s image bot from visiting any page on your website, so it is incredibly useful if you do not want any of the images on your website appearing on Google Images. Keep in mind, however, that this is purely Google’s image crawler, so will not stop image bots from other search engines from visiting your images and possibly ranking them.
User-agent: Googlebot-Image
Disallow: /images/cookies.jpg
It’s possible to block only specific images from getting crawled by bots. To do that, use a disallow directive and the path to the file in the server. Great if particular images display information you do not want listed on Google Images, but are fine with other images on your website being indexed.
User-agent: *
Disallow: /*.pdf$
Blocking specific file types is also possible. For example, to block 1 file type on the site, use a regular expression with a wildcard (*), mention the file extension (PDF, in this example) and finish the expression with a dollar sign($). The dollar sign indicates the end of the URL.
My advice is that you should follow the standard structure and formats of robots.txt mentioned in this guide, but below are some tips for using robots.txt more efficiently.
The robots.txt file supports the use of RegEx. This will make declaring the instructions in the file simpler because you can group instructions into one expression instead of writing one directive for each URL.
A practical example is search parameters on a site. Without RegEx, you would write the following directives:
User-agent: *
Disallow: /cookies/chocolate?
Disallow: /cookies/cream?
Disallow: /cakes/strawberry?
Disallow: /cakes/vanilla?
With RegEx, the robots.txt file would look more like this:
User-agent: *
Disallow: /cookies/*?
Disallow: /cakes/*?
The wildcard (*) will include all the cookies and cakes directory pages in this example. It would become a burden for websites with several directories to maintain the document, thus using RegEx is more efficient.
Most crawlers read the robots.txt from top to bottom and follow the first applicable group for their user agent, and those user agents will only follow one group of directives. If you mention a crawler more than once, they will ignore one of the groups. That isn’t the case with Google and Bingbot, as they follow the more specific rules, and in Google’s case, they will bundle the groups as if it’s only one.
However, to avoid confusion, it’s better to list the specific user agents at the top and put the group with the wildcard for all non-mentioned crawlers at the bottom.
Being specific in the robots.txt pays off and prevents unintentional consequences of bots not crawling essential sections on your site.
Imagine you don’t want bots to crawl the cookies folder, and you create the following disallow rule:
User-agent: *
Disallow: /cookies
This disallow rule (“/cookies”) will also prevent bots from crawling the following URLs:
/cookies-recipes.html
/cookies-best-recipes-for-download.pdf
The solution would be to fix the directive and mention /cookies/, like this:
User-agent: *
Disallow: /cookies/
Robots.txt isn’t a ranking factor or a requirement for having good organic results, and most websites won’t notice any difference if they don’t use one.
However, using a robots.txt file becomes more important as the website grows and has more pages. So for big websites, robots.txt matter more and can have a significant influence!
As mentioned above, the robots.txt will help manage traffic from crawlers by managing the crawl budget (internal link) or preventing server overload.
The main limitations of robots.txt are the following:
If you don’t use a robots.txt file, crawlers will interpret that there aren’t limitations on the pages they can visit and index from your site. That’s the default behavior in the absence of a robots.txt file.
As Google mentions in their documentation, if a robots.txt is inaccessible, they would act as if the file doesn’t exist and crawl every page on the website.
One way to put this, simply, is if there are pages on your website that you do not want search engine bots to visit and crawl, you need to ensure these are all listed within a robots.txt file.
Google won’t see a page with the noindex tag if it’s disallowed by robots.txt. So by combining disallow and noindex, you tell Google that you don’t want them to visit the page, but as they can’t visit the page, they won’t see your instruction not to index it.
If you don’t want the page indexed, it’s better to use the noindex tag because the instruction is more specific to your goal.
Visit the Coverage report on Google Search Console to find any error or blocked page currently indexed. The pages with errors will be under the Excluded status at “Blocked by robots.txt”. You should look here to see if any pages are unintentionally blocked.
You would still find “Indexed, though blocked by robots.txt” on the Valid with warning status. These are potentially the pages that you don’t want Google to index. A solution is to allow the crawling of the pages and use a noindex tag. To fix this solution faster, manually ask Google to crawl the page after putting the noindex tag on it.