SEO Indexation Control Fundamentals

Earlier this week Google announced effective from 1st September, it will no longer observe unsupported and unpublished rules in robots.txt files. One notable example of this is advisory that Google will no longer support inclusion of the noindex directive within a robots.txt file. Google advised the decision to discontinue support of such unsupported and unpublished rules was to maintain a “healthy ecosystem” and prepare for potential future open source releases ahead of the open-sourcing of Google’s production robots.txt parser.

In this post I’m going to explore alternative more optimal indexation control methods and more crucially which methods are best suited to specific scenarios.

What is indexation control?

Indexation control refers to the process of controlling access to a website by search engine crawlers. There are numerous scenarios which may result in a requirement to discourage search engine crawlers from directly crawling specific website content. The exact scenario will determine which indexation control method is best suited to the requirements.

Indexation control methods

Robots.txt

Robots.txt is a text file which you can create to instruct search engine crawlers how to crawl a website.

Example syntax:

The below syntax prevents access to all website content from all user agents.

User-agent: *
Disallow: /

While this syntax would prevent access only to Google and solely to the the /admin/ directory:

User-agent: googlebot
Disallow: /admin/

 

Meta Robots

The robots meta tag allows you to utilise a more granular approach to controlling how an individual page should be indexed and served to users in search results. The robots meta tag should be inserted directly into the <head> section of the given page where indexation control is required.

Example syntax:

The below syntax instructs all crawlers to noindex a specific page:

<meta name=”robots” content=”noindex”>

 

X-Robots tag

The X-Robots-Tag can be used as an element of the HTTP header response for a given URL. Any indexation directive which can be utilised in a robots meta tag can also be specified as an X-Robots-Tag.

Example syntax:

The below X-Robots syntax again instructs all crawlers to noindex a specific page:

HTTP/1.1 200 OK

Date: Tue, 05 May 2019 20:05:20 GMT

(…)

X-Robots-Tag: noindex

(…)

 

Rel canonical

A canonical tag is a method to help inform search engines that a specific URL represents the primary instance of a page. As a result the canonical tag is an indexation control method as it informs search engines to only index the primary URL. In addition to specifying the preferred URL to show in search results, canonical links can also help achieve the following:

– Consolidate link signals for similar / duplicate pages
– Simplify and consolidate tracking metrics
– Help manage syndicated content
– Help optimise crawl budgets (typically large sites only)

Example syntax:

The below canonical link instructs the search engines that the following URL: https://chrismann.uk/technical/indexation-control-fundamentals/, is the primary URL instance of this particular blog post:

<link rel=”canonical” href=”https://chrismann.uk/technical/indexation-control-fundamentals/”>

 

Password protection

Placing a web page behind a login will typically prevent it from being crawled by search engine crawlers and ultimately showing within their results. The only exception to this rule is when when specific markup is utilised to indicate subscription or paywalled content, for example flexible sampling.

 

Google Search Console

The Remove URLs tool allows you to temporarily block pages from Google Search results. However it can only be implemented for sites which you own / manage via Google Search Console. It is also key to note temporary removal lasts for approx. 90 days. Permanent removal can only be achieved via:

  • Removal of content (404 / redirect)
  • Block access to the content (require login)
  • Indexation protection
  • Currently only available in the old Search Console

 

Indexation control scenarios

Staging site

Scenario:

You are preparing to launch a new site, however for now the legacy site remains live. External access is required to the staging environment of the new website.

Recommended indexation control:

Robots.txt disallow rule

or

Page level meta robots noindex instruction

or

Page level X-Robots noindex instruction

or

Password protection

Why?

A robots.txt disallow rule would typically be the preferred method where there is a high number of pages requiring indexation control in order to help preserve crawl budget. In situations where direct access is required to only a limited number of pages then page level meta robots directives are likely to be quicker and lower risk option.

A further potential and arguably more secure option, could be to implement password protection / IP control to restrict access to the staging environment.

 

Staging site (accidental indexation)

Scenario:

In situations where indexation protection is not followed as detailed in the previous scenario, it can result in accidental indexation of content. Such indexation can not only be harmful from a privacy perspective but also a direct constraint to organic visibility. It’s key to note that indexation control options are notably reduced in comparison to the previous scenario if accidental indexation does occur.

Recommended indexation control:

Page level meta robots noindex tag

or

Page level X-Robots noindex instruction

and / or

Submission of temporary removal request via Google Search Console (valid for 3 months)

Why?

If indexation control is implemented via a Robots.txt disallow rule then all respective pages requiring deindexation will remain indexed, albeit absent of a meta description. To ensure all such pages are successfully deindexed, crawler access must be maintained with indexation control added at a page level via meta robots noindex directive.

To help speed up the deindexation of such content, if practical it is often beneficial to to submit a temporary removal request via Google Search Console.

 

Duplicate content

Scenario:

Its common for sites to feature some degree of duplicate content. Duplicate content can often act as an SEO constraint as a result of it causing confusion to search engines as to which content to serve within its search results.

One such example of where duplicate content can inadvertently occur could be for an eCommerce store with numerous product pages featuring almost identical content, i.e. different colours.

Recommended indexation control:

Page level rel canonical

Why?

If crawling is disallowed of such pages by search engine crawlers via robots.txt / nofollow then the rel canonical instruction would be precluded. While such content wouldn’t be indexed, if Google can’t see the rel canonical instruction then if someone linked to the blue version instead of the default version no authority would effectively be passed.

 

Search results pages

Scenario:

Google has reiterated many times that they actively discourage the inclusion of internal search result pages within their own search results. The exposure of such pages can result in a negative SEO impact as a result of under utilised crawl budget and dilution of contextual relevance around key content themes.

Recommended indexation control:

Robots.txt disallow rule

Why?

Disallowing access to all such result pages helps ensure crawl budget is preserved for the processing of high value pages.