Skip to main content

When robots.txt prevents Google from forgetting

Magnus Bråth

In robots.txt, you can prevent Google from finding certain pages on your site, which is why it’s a file many SEO professionals keep a close eye on. Sometimes, however, it can accidentally do the opposite with the very same setting — and that’s one of the issues Aaron highlights in the latest SEO Clinic.

When tweaking your robots.txt, you really need to know what you’re doing. If you block Google from reading an already indexed page, it will never disappear from the index. Stigasports.com happened to make exactly this mistake, and it’s one of the problems Aaron Axelsson addresses in our latest Live SEO session on YouTube. Here’s the video — it may be worth watching before you read on (if you have the time).

What happened is that Google, one way or another, found links to pages on the site that were blocked in Robots.txt. To understand what happens then, we need to understand what robots.txt actually does and how it differs from, for example, noindex.

What is Robots.txt?

Robots.txt is a file placed at the root level of your site, located at yoursite.com/robots.txt. In this file, you can specify where to find a sitemap and which pages crawlers should not visit. Google and most others interpret this literally and simply don’t visit those pages — they don’t read them or check headers or anything like that.

Unlike setting <meta name="robots" content="noindex"> on a page, which still allows Googlebot to visit but prevents the page from being saved in Google’s index, blocking in robots.txt means Googlebot won’t visit the page at all. The difference may seem small, but it can lead to problems, as in Stiga’s case.

What happened on Stigasports.com?

Here’s what happened on Stigasports.com: on one or more pages, there were (or still are) links pointing to pages blocked in Robots.txt. It’s not certain that those pages were always blocked — it might have been a deliberate SEO action, for example, to prevent too many duplicates from being created by site filters. The problem is that when you block these pages in robots.txt, Google completely stops visiting them — but they don’t get removed from the index. Instead, the pages remain but appear essentially empty in Google’s eyes. Rather than showing an old cached result, Google displays: “No information is available for this page.”

This issue is somewhat similar to having lots of duplicate content. It may not consume crawl budget (the amount of time and resources Google is willing to spend indexing your site) in the same way as duplicates, but it still lowers trust in your site.

What’s the solution?

The solution to the problem of pages showing “No information is available for this page” is almost always to let Google in by removing the line in robots.txt and instead delivering either a 301 redirect, a canonical tag, or a noindex. Each has slightly different effects, and depending on the overall situation, you’ll need to choose the right one — more on that in another blog post.

Magnus Bråth Consultant & Adviser

Magnus is one of the world's most prominent search marketing specialists and primarily works with management and strategy at his agency Brath AB.