Robots.txt and noindex tags
Have you ever wondered how you can prevent Google from indexing a page that you don’t want appearing in search results? Or have you landed on a page that you don’t understand the purpose of? As a website owner, you have control over which parts of your site should be visible to search engines and which should be kept in the background. Two important tools for this are the robots.txt file and the noindex tag.
You can think of robots.txt as a bouncer at the entrance of your website. It tells search engine robots which rooms they are allowed to enter and which doors should remain closed. The noindex tag works more like a small sign on a specific door that says:
“You’re welcome to take a look, but don’t include this room in the catalog.”
To use them correctly, you need to understand the difference between blocking crawling and preventing indexing. Let’s go through both steps.
What is robots.txt and how do you use it?
Robots.txt is a simple text file that is always placed in the root directory of the website. If your site is located at www.example.com, you will find the file at www.example.com/robots.txt. It serves as an instruction to search engine robots about which parts of the site they should crawl and which they should skip.
In the file, you use directives, with the most common ones being:
- User-agent: specifies which robot the instruction applies to.
“User-agent: *” means that all search engines should follow the rule. - Disallow: specifies which folder or page should not be crawled.
An example of a robots.txt file:
User-agent: *
Disallow: /internal/
Disallow: /test.html
Here, you are telling all robots that they are not allowed to crawl the folder “internal” or the file “test.html.”
It’s important to understand that robots.txt only controls crawling, not indexing. If another website links to a page that you’ve blocked, Google can still index it, but without seeing the content. In that case, the page may appear in search results as a link without a description, which rarely looks good.
🔍 Ask yourself:
Are there folders or files on your site that Google really doesn’t need to crawl? For example, admin panels, internal test areas, or files that don’t provide value to the user.
What is noindex?
The noindex tag is a more precise tool. It is placed in the <head> section of the HTML code on a specific page and tells search engines:
“This page should not appear in search results.”
An example of how a noindex tag looks:
<meta name=”robots” content=”noindex”>
Do not block the page in robots.txt
The key here is that the search engine must be able to visit the page first in order to see the tag. If you block the page in robots.txt, Google will never access the code and won’t know it should be excluded. Therefore, noindex is the most reliable way to keep pages out of search results.
Typical examples of when you would want to use noindex are:
- A “thank you for your purchase” page, which is only relevant to those who have already made a purchase
- Login pages or internal dashboards
- Duplicate or filter parameters in an online store that you don’t want to compete with main products
🔍 Ask yourself:
Are there pages on your site that are necessary for users but shouldn’t appear on Google?
When should you use which?
Now that you understand the basics, the next step is knowing when to use robots.txt and when noindex is the right way.
🔵 Use robots.txt when you want to save Google’s crawl budget and prevent unnecessary files or folders from being read. This may include logs, internal search results pages, or large files that should not be indexed.
🔵 Use noindex when you want a page to be accessible to users and robots but not appear in search results. It’s the most reliable way to keep content out of the index.
A common mistake is accidentally blocking important pages in robots.txt, such as the entire /blog/-folder. In that case, Google can’t even see the content, and you lose valuable visibility. Another pitfall is adding noindex to the wrong pages, like important landing pages, causing them to disappear from search results.
You should never manage a robots.txt file if you are unsure. Contact us if you need help.
🔍 Think about:
Which pages on your website are important for Google to see, and which can you safely exclude?
Practical tips to avoid mistakes
To succeed with robots.txt and noindex, it’s good to work methodically. Some tips:
- Always create a backup of your robots.txt before making changes
- Test the file with Google’s robots.txt tool in Search Console
- Use noindex instead of robots.txt when you want to exclude a page from search results
- Double-check that important pages like the homepage, product pages, and blog posts are not accidentally blocked
- Consider combining with a sitemap so that Google has a complete overview of the pages you actually want to show
In summary
Robots.txt and noindex are two powerful but different tools. Robots.txt acts like a bouncer, controlling where Google can go, while noindex is the sign on the door telling a page not to appear in search results.
By understanding the difference and using them correctly, you can avoid common mistakes, save resources, and ensure that only the right pages appear in Google’s index.
The next step is clear ➡️ Review your own website, create or update your robots.txt file, and use noindex where it makes sense. This way, you build a clear structure for both users and search engines, increasing the chances that your most important pages reach the right audience.