Controlling When Your Website
Is Indexed with HTTPS

Among the canonicalization problems that can harm a website's rankings is when Google or one of the other search engines begins to index your pages with HTTPS, the secure protocol, instead of the normal HTTP. This can happen when Google discovers a link to your site that includes the HTTPS prefix in the URL, whether that link resides on your own site or was posted by someone else - either accidentally or maliciously. When this happens, it can lead to Duplicate Content issues because Google will usually see identical content on your site with both versions of the URL. This article discusses the causes and the cure.

Just Added: How To Switch Your Website To All HTTPS

SEO Tips & Tutorials » Removing & Setting HTTPS

Blocking & Removing Pages Indexed with HTTPS

This problem begins when Google encounters a link to your site using HTTPS. This most often occurs on e-commerce sites, but any site that deals with private user information may well protect portions of their site with the SSL (Secure Sockets Layer) service employed via HTTPS. Ordinarily, a webmaster will block the search engines from accessing these protected areas through an instruction in the site's robots.txt file. But if you don't use robots.txt, or if the instruction is poorly crafted, the gates are open for search engines to crawl those pages as they would any other.

Once Google or other search engine crawls a page with HTTPS, it can begin to crawl the rest of the site with HTTPS if you rely on relative links on your pages. A relative link is one that uses a shorthand version of the URL in the "href" attribute of the <a>nchor tag, and does not include either the protocol ('HTTP' or 'HTTPS') or the domain name ('www.example.com'). Both search engines and browsers will use the same protocol as they used to access the page where such a link resides in constructing the complete URL to assign to that link. So, if your site relies heavily on relative links and does not take steps to prevent search engines from indexing pages with HTTPS, the problem can cascade through your entire website and disrupt your site's performance in the rankings. Here are some steps you can take to prevent this problem from hurting your website:

Start by creating a good robots.txt file. It's a good idea to limit HTTPS access to specific directories within your site so that you can control when and where HTTPS is used. Then you can include an instruction in your robots.txt file to block the search engines from crawling those directories with something like:

Disallow: /directory/

There's a tool in Google's Webmaster Tools console that will let you test your robots.txt file to make sure that you are properly blocking all of the pages within the directories you want to protect with HTTPS. Naturally, doing this falls in the "an ounce of prevention" category. You'll need to take further steps if some of your pages are already improperly indexed in the search engines.
Use a robots <META> Tag on All Pages Using HTTPS. Using a robots <META> tag on pages designed to be accessed with HTTPS will go a long way toward preventing this problem. Simply add:

<meta name="robots" content="noindex,nofollow">

to the <head> section of each of these pages. This prevents the search engines from both indexing the page where this tag resides and from following any links that also reside on the page. If any of your pages designed for HTTPS access have already been indexed, be sure to add this <META> tag to all such pages and then temporarily remove the blocking instruction from your robots.txt file. This will allow the search engines to see this <META> tag, which will cause them to remove the page from the index. Once the pages have been removed, you can restore the blocking instruction in your robots.txt file.
Use the rel="canonical" Tag. The rel="canonical" tag tells the search engines the correct URL for a page. It's always a good idea to add this tag to your site's main page to prevent the common canonicalization problems with the "www." prefix, but it will also serve to prevent it from being indexed with HTTPS as well. You can use this tag in many situations where a page might be accessed with different URLs, and you can also use it when a page has already been improperly indexed with HTTPS. For details on the rel="canonical" tag, see Google's article Specify Your Canonical. This is both an "ounce of prevention" and a "pound of cure" that's easy to implement and does the job pretty well in a single step. You'll find another step you can take to reinforce this setting later in this article.
Use Complete URLs in Your Internal Links. Web designers like to use relative URLs when they create webpages because it often simplifies testing page layouts on their computer before uploading them to the server. But, as we've seen, this can lead to search engines following improper paths through the site once they've latched on to a link that resolves to starting with "https://". Get in the habit of using complete URLs and you'll be doing your site a big favor.
Use a Special robots.txt File For HTTPS. You can serve a special robots.txt file when your server receives a request for /robots.txt using HTTPS. If your server uses Apache server software, you can add an instruction near the top of your .htaccess file to handle this, such as:

RewriteCond %{HTTPS} ^on$
RewriteCond %{REQUEST_URI} ^/robots.txt$
RewriteRule ^(.*)$ /robots_https.txt [L]

You may need a different instruction, depending on your server environment. If the above example doesn't work for you, try:

RewriteCond %{SERVER_PORT} !80
RewriteCond %{REQUEST_URI} ^/robots.txt$
RewriteRule ^(.*)$ /robots_https.txt [L]

This instruction should be placed before any other redirects in your .htaccess file so that it will be processed first.

Next, create a special robots.txt file using a different file name. In my example, I use "robots_https.txt". Modify the "RewriteRule" in the .htaccess code above to use whatever file name you choose. Then, create a new text file using that file name, and fill it with:

User-agent: *
Disallow: /

The combination of the .htaccess settings and the special robots.txt file will block the search engines from using HTTPS for any URL on your site. If your server uses Microsoft IIs software, contact your hosting service for advice on implementing this.
Redirect HTTPS Requests For Normal Pages. If some of your pages have already been improperly indexed with HTTPS, it's a good idea to set up 301 redirects for those pages and unblock them in your special robots.txt file (if any) so that the search engines can try to re-crawl those pages and discover the new redirect. A sample .htaccess instruction for this would be:

RewriteCond %{HTTPS} ^on$
RewriteCond %{REQUEST_URI} !^/https-allowed-directory/(.*)?$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

or, (again) if your server uses a different HTTPS indicator field:

RewriteCond %{SERVER_PORT} !80
RewriteCond %{REQUEST_URI} !^/https-allowed-directory/(.*)?$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

Note that this example allows HTTPS access to one directory (such as the "admin" directory for a blog or an e-commerce website). You can repeat this instruction for other directories that depend on https access as well. Once the search engines have seen this redirect a few times, you should go ahead and restore the blocking instructions in your special robots.txt file. Letting the search engines see the 301 response from URLs that have been indexed with HTTPS will effectively remove them from the index.

Changing Your Website to Use All HTTPS URLs

Google has recently announced that they will be giving a slight ranking boost to sites whose pages are served using HTTPS. This will certainly cause many webmasters to make the switch, but you need to use the same care described here when switching protocols TO HTTPS as when you want to REMOVE HTTPS. That means using the rel="canonical" tag on your pages, carefully crafting 301 redirects, and updating your robots.txt file to make sure that your sensitive pages are never indexed.

Your .htaccess file should include one of the two following instrunctions:

RewriteCond %{HTTPS} ^off$
RewriteRule ^(.*)$ https://www.example.com/$1 [R=301,L]

or, (again) if your server uses a different HTTPS indicator field:

RewriteCond %{SERVER_PORT} ^80$
RewriteRule ^(.*)$ https://www.example.com/$1 [R=301,L]

When you change your website to use HTTPS, it's important to notify the search engines directly about the change through the Google Webmaster Tools console and Bing's Webmaster Tools to speed up their indexing of your new URLs. See my article on Changing Your URLs for more information.

Summary

In summary, removing pages that have been improperly indexed with HTTPS requires a bit of effort. The rel="canonical" tag is the easiest method for removing your pages from the search engines that were indexed with HTTPS, but it can take a long time for them to resolve the situation. Always using the robots <META> tag set to "noindex" on pages that you never want to be indexed will go a long way to preventing the problem as well. And serving a special robots.txt file is an added layer of prevention and will, in time, also repair the problem. The ultimate sledge-hammer-approach method is to also install 301 redirects for directories or individual pages that have been improperly indexed.

These steps will help reduce the risk of your site developing duplicate content or canonicalization problems, and can also remove pages from the search engine's index that have been improperly indexed with "https://". It's critical your "pound of cure" to let the search engines see the new status of the badly indexed URLs before they will remove them from the index.

This SEO Tip was last updated on September 25, 2020

More SEO Tips

Controlling When Your WebsiteIs Indexed with HTTPS

Blocking & Removing Pages Indexed with HTTPS

Changing Your Website to Use All HTTPS URLs

Summary

Controlling When Your Website
Is Indexed with HTTPS