Website Design by Rainbo Design

Preventing Web Pages from
Being Indexed with HTTPS

Among the canonicalization problems that can harm a website's rankings is when Google or one of the other search engines begins to index your pages with HTTPS, the secure protocol, instead of the normal HTTP. This can happen when Google discovers a link to your site that includes the HTTPS prefix in the URL, whether that link resides on your own site or was posted by someone else - either accidentally or maliciously. When this happens, it can lead to Duplicate Content issues because Google will usually see identical content on your site with both versions of the URL. This article discusses the causes and the cure.


Blocking & Removing Pages Indexed with HTTPS

This problem begins when Google encounters a link to your site using HTTPS. This most often occurs on e-commerce sites, but any site that deals with private user information may well protect portions of their site with the SSL (Secure Sockets Layer) service employed via HTTPS. Ordinarily, a webmaster will block the search engines from accessing these protected areas through an instruction in the site's robots.txt file. But if you don't use robots.txt, or if the instruction is poorly crafted, the gates are open for search engines to crawl those pages as they would any other.

Once Google or other search engine crawls a page with HTTPS, it can begin to crawl the rest of the site with HTTPS if you rely on relative links on your pages. A relative link is one that uses a shorthand version of the URL in the "href" attribute of the <a>nchor tag, and does not include either the protocol ('HTTP' or 'HTTPS') or the domain name ('www.example.com'). Both search engines and browsers will use the same protocol as they used to access the page where such a link resides in constructing the complete URL to assign to that link. So, if your site relies heavily on relative links and does not take steps to prevent search engines from indexing pages with HTTPS, the problem can cascade through your entire website and disrupt your site's performance in the rankings. Here are some steps you can take to prevent this problem from hurting your website:

  1. Start by creating a good robots.txt file. It's a good idea to limit HTTPS access to specific directories within your site so that you can control when and where HTTPS is used. Then you can include an instruction in your robots.txt file to block the search engines from crawling those directories with something like:

    Disallow: /directory/

    There's a tool in Google's Webmaster Tools console that will let you test your robots.txt file to make sure that you are properly blocking all of the pages within the directories you want to protect with HTTPS. Naturally, this is in the "an ounce of prevention" category. You'll need to take further steps if some of your pages are already improperly indexed in the search engines.

  2. Use a robots <META> Tag on All Pages Using HTTPS. Using a robots <META> tag on pages designed to be accessed with HTTPS will go a long way toward preventing this problem. Simply add:

    <meta name="robots" content="noindex,nofollow">

    to the <head> section of each of these pages. This prevents the search engines from both indexing the page where this tag resides and from following any links that also reside on the page. If any of your pages designed for HTTPS access have already been indexed, be sure to add this <META> tag to all such pages and then temporarily remove the blocking instruction from your robots.txt file. This will allow the search engines to see this <META> tag, which will cause them to remove the page from the index. Once the pages have been removed, you can restore the blocking instruction in your robots.txt file.

  3. Use the rel="canonical" Tag. The rel="canonical" tag tells the search engines the correct URL for a page. It's always a good idea to add this tag to your site's main page to prevent the common canonicalization problems with the "www." prefix, but it will also serve to prevent it from being indexed with HTTPS as well. You can use this tag in many situations where a page might be accessed with different URLs, and you can also use it when a page has already been improperly indexed with HTTPS. For details on the rel="canonical" tag, see Google's article Specify Your Canonical. This is both an "ounce of prevention" and a "pound of cure" that's easy to implement and does the job pretty well in a single step. You'll find another step you can take to reinforce this setting later in this article.

  4. Use Complete URLs in Your Links. Web designers like to use relative URLs when they create webpages because it simplifies testing page layouts on their computer before uploading them to the server. But, as we've seen, this can lead to search engines following improper paths through the site once they've latched on to a link that resolves to starting with "https://". Get in the habit of using complete URLs and you'll be doing your site a big favor.

  5. Use a Special robots.txt File For HTTPS. You can serve a special robots.txt file when your server receives a request for /robots.txt using HTTPS. If your server uses Apache server software, you can add an instruction in your .htaccess file to handle this, such as:

    RewriteCond %{HTTPS} ^on$
    RewriteCond %{REQUEST_URI} ^/robots.txt$
    RewriteRule ^(.*)$ /robots_https.txt [L]


    You may need a different instruction, depending on your server environment. If the above example doesn't work for you, try:

    RewriteCond %{SERVER_PORT} !80
    RewriteCond %{REQUEST_URI} ^/robots.txt$
    RewriteRule ^(.*)$ /robots_https.txt [L]

    Next, create a special robots.txt file using a different file name. In my example, I use "robots_https.txt". Modify the .htaccess code above to use whatever file name you choose, then fill it with:

    User-agent: *
    Disallow: /


    This will block the search engines from using HTTPS for any URL on your site. If your server uses Microsoft IIs software, contact your hosting service for advice on implementing this.

  6. Redirect HTTPS Requests For Normal Pages. If some of your pages have already been improperly indexed with HTTPS, it's a good idea to set up 301 redirects for those pages and unblock them in your special robots.txt file (if any) so that the search engines can try to re-crawl those pages and discover the new redirect. A sample .htaccess instruction for this would be:

    RewriteCond %{HTTPS} ^on$
    RewriteCond %{REQUEST_URI} !^/https-allowed-directory/(.*)?$
    RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]


    or, (again) if your server uses a different HTTPS indicator field:

    RewriteCond %{SERVER_PORT} !80
    RewriteCond %{REQUEST_URI} !^/https-allowed-directory/(.*)?$
    RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]


    Note that this example allows HTTPS access to one directory, but you may need to allow access to individual pages rather than an entire directory. Once the search engines have seen this redirect a few times, you should go ahead and restore the blocking instructions in your special robots.txt file. Letting the search engines see the 301 redirects on URLs that have been indexed with HTTPS will effectively remove them from the index.

In summary, removing pages that have been improperly indexed with HTTPS requires a bit of effort. The rel="canonical" tag is the easiest method for removing your normal pages that were indexed with HTTPS, but it can take a long time for the search engines to resolve the situation with HTTPS. Always using the robots <META> tag set to "noindex" on pages that you never want to be indexed will go a long way to preventing the problem as well. And serving a special robots.txt file is an added layer of prevention and will, in time, also repair the problem. The ultimate sledge-hammer-approach method is to also install 301 redirects for directories or individual pages that have been improperly indexed.

These steps will help reduce the risk of your site developing duplicate content or canonicalization problems, and can also remove pages from the search engine's index that have been improperly indexed with "https//". It's critical your "pound of cure" to let the search engines see the new status of the badly indexed URLs before they will remove them from the index.


This SEO Tip was last updated on March 30, 2012



More SEO Tips

Preparing Your Website for Search Engines

Search Engine Friendly Web Design

Optimization Common Mistakes

Why Is My Website Not Indexed?

Get Higher Google Ranking

Search Engine Ranking Factors

Getting Links for Your Site

Finding Keywords for Search Marketing

Search Engines and Frames

Fixing Google Canonicalization Errors

Multiple Domain Names Problems

Top 10 Search Engine Optimization Myths

Google's PageRank Explained

Site Redirect Without .htaccess

Why Did My Site's Google Ranking Drop?

Tracking Codes in Your Links/URLs

HTTP Server Response Header Checker

How To Tell If A Site is Banned

How To Set Your Website's Geo-location

Google Malware Warning

Removing/Blocking HTTPS URLs

Best Way To Change Your URLs

How To Use rel="nofollow"

Need More Help?
You'll find more SEO Tips on the menu on the right side of this page.
You can also contact me with your SEO questions.

If you can't fix your website search engine problems on your own, my Search Engine Optimization Services can give your website what it needs to get your fair share of search engine traffic quickly, without disturbing your website's design, and without breaking your budget.

Call Richard L. Trethewey at Rainbo Design in Minneapolis today at 612-408-4057 from 9:00 AM to 5:00 PM Central time
to get started on your affordable website design package or search engine optimization program today!




 Comments or Questions?
 Contact Rainbo Design
  Share This Page!