The ultimate guide to robots.txt

Tyler Scionti

  | Published on  

September 29, 2023

When we say robots.txt you may be thinking of something like the Disney-Pixar movie Wall-E or another famous robot in pop culture. 

Robots.txt files actually play an extremely important role for website owners and marketers. We've written on site crawling before, as a robots.txt file is one of the most important tools if you want to optimize your site and improve your rank.

This post starts to get a bit more technical when it comes to search engine optimization, if you're newer to SEO check out our full intro to SEO here!

This blog post will explore the basics of a robots.txt file, its importance and how you set it up across different content management platforms.

Keep reading or skip ahead by clicking the platform of choice for instructions on how to set up your robots.txt file:

What is a Robots.txt File?

While a robots.txt may sound intimidating, it's a pretty simple concept.

A robots.txt file is a file set for a website to tell search engine crawlers which pages the crawler can or cannot analyze and index.

In other words, it tells Google which pages are off-limits to be analyzed and shown in search results. As you search the web for information about these files, you may also see the term robots exclusion protocol used to refer to the same concept.

You might be wondering, why would I want any pages to hidden from Google. After all the more content Google has from a website the better right?

Close - not all content is of value and you would be surprised to see how many pages get auto-created for a website that are either not worth showing in search results or potentially contain private information that should not be indexable at all.

For example, our website is built with Wordpress and as a result many pages are auto-created for the different tags and categories that we have for our blog. These category pages add little value as they're simply listing pages for our blog.

Another example could be an e-commerce shop that has checkout or billing pages that should not be crawled and indexed. These pages contain private information, or may be behind a login so the only content Google would pick up would be an error message. Either way, they're not worthwhile pages to have indexed.

While ranking highly in SEO is an important goal, you must think about the quality and quantity of the pages on your website.

When you submit a website to be crawled, the crawler is going to analyze every page on your website. Google's job is to assess the quality of a website by reviewing all the pages it has access to. Giving Google access to pages that do not add value can be detrimental to your rank and hurt your ability to appear in search results.

It's also important to manage the quantity of pages Google has access to and to limit the number of pages their crawlers scan. Google sets a 'crawl budget' when scanning a site that is constrained by two factors.

The first is the crawl rate limit. The crawl rate limit limits the maximum fetching for a given site. This number represents the number of parallel connections the crawler uses to crawl the site as well as the time between fetches.

The second factor is known as crawl demand. If a crawl limit is not reached but the demand for indexing is low, there will be low activity from the crawler. A website’s popularity is one way to tell whether a page will be crawled more frequently than others.

Because of this crawl budget, you want to tell Google what pages are the most crucial to your website. You do not want the crawler wasting its time to analyze pages that do not generate the most traffic. That is where the power of robots.txt files come into play.

What does a Robots.txt file look like?

After reading all of this, the actual robots.txt file may be a bit underwhelming. All a robots.txt file is, is a simple list of url paths that Google is not allowed to crawl, typically it will look a bit like this:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /author/
Disallow: /category/
Disallow: /tag/

There are two pieces to note here: the user agent, and the disallow path.

The user agent names the crawler to specifically block. An asterisk (*) means that every crawler/search engine is blocked from analyzing the following url paths. Some robots.txt files will be more sophisticated and name specific crawlers to customize the rules per search engine, but it's not necessary for most websites.

Following the user agent is a series of lines starting with the word Disallow. Each of these represents a url path that crawlers are blocked from analyzing. These aren't full URLs, instead they're a chunk of a URL that you don't want any search engine to analyze. As a result, any URL containing this path will not be crawled.

Where can I find the robots.txt on my website?

Crawlers always look for the robots.txt file in one specific place: the main directory of your website.

This means that they look for it at the URL yourdomain/robots.txt, for example ours can be found at www.centori.io/robots.txt. If the robots.txt file existed but at a different url, say yourdomain.com/index/robotst.xt it would not be found by crawlers (they're smart, but not that smart).

Don't worry - any content management system will set this up automatically for you, you don't have to worry about your robots.txt file not being found. If you ever want to see your robots.txt file, just add '/robots.txt' after your home page url and you'll be able to review it.

How to edit your Robots.txt Files on WordPress

In order to access the robots.txt files for WordPress, you will need to download a plugin. The simplest one to go to is from Yoast.

You can download the free Yoast plugin at this link or in the plugin directory. We use Wordpress for our own site and while we love it, this is one thing to be aware of - your website will not have full SEO capabilities unless you add to it.

Once you download the plugin, you can now edit robots.txt files from an easy-to-use interface. First, head to Yoast in the sidebar menu and select Tools.

From there you'll be brought to the Yoast tools list, from there select File Editor.

And voila! You've got your robots.txt file set up which you can edit with ease.

Now, what do we add to this file?

Assume for a moment that you want the site to crawl everything except the administrator’s pages on WordPress. Here is what you would enter:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php

This example would tell GoogleBot (and other crawlers that listen to the robots.txt) to avoid the often-sensitive content behind your website allowing the crawler to focus primarily on what your audience sees.

How to edit your Robots.txt Files on SquareSpace

Where Wordpress is entirely 'plug and play', SquareSpace takes an opposite stance - they set up the robots.txt for you.

So the short answer is "you can't", but let's unpack that a bit.

SquareSpace does not allow access to manage your own website's robots.txt as they set up a standard one for all websites built on their platform. They automatically ask Google not to crawl certain pages because they are for internal use only or display duplicate content.

You can see a full list of what they allow and disallow for site crawling at this sample robots.txt file. If customization is your jam, SquareSpace may not be the CMS for you, for nontechnical folks out there though this provides an excellent solution to a tricky problem.

How to edit your Robots.txt Files on Wix

Wix is similar to SquareSpace in that they also do not allow you to edit the robots.txt file for your website.

Wix does prevent the administrators’ pages from being crawled automatically because there’s no benefit to them being read by search engines. Like SquareSpace, they've got you covered there.

You can work around this and hide a page from a search engine by adding a 'no index tag to an individual page which will hide the page from search results.

How to edit your Robots.txt Files on HubSpot

HubSpot gives full access to the robots.txt file, giving you complete customization in a pretty user-friendly interface.

Simply head to settings by clicking the gear icon in the main menu, then select Website and Pages from the sidebar menu. Choose the domain you wish to modify (if you have multiple) and then head to SEO & Crawlers which will open up the robots.txt editor.

Click Save and you're done!

How to edit your Robots.txt Files on Webflow

Like HubSpot, Webflow affords complete access to your robots.txt file in a simple interface.

Head to Project Settings then SEO and Indexing which will bring you to the robots.txt editor. Add in your rules and save the changes then you're done! Pretty easy right?

How to edit your Robots.txt Files on Shopify

Like other platforms that lean towards simplicity, Shopify does not permit you to edit your robots.txt file directly.

However like Wix, the best workaround is to add a 'no index' tag to the pages you do not want indexed by Google.

Shopfiy advises users to add this code snippet to the <head> section of a page if they want it hidden:

{% if handle contains 'page-handle-you-want-to-exclude' %}
<meta name="robots" content="noindex">
{% endif %}

Or if you want to exclude the search results template you can add this:

{% if template contains 'search' %}
<meta name="robots" content="noindex">
{% endif %}

Going Forward

Hopefully this helps you become an SEO and crawling master!

Master might feel like a strong word now, but getting your robots.txt file set up and optimized gives you a giant leap towards putting your website in the best position possible to rank.

10x your traffic with our proven SEO strategy framework

Get the same strategy framework we teach every single client. Follow these 4 steps to outsmart your competitors on Google and rank your website higher than ever.