Among the canonicalization problems that can harm a website's rankings
is when Google or one of the other search engines begins to index your pages with HTTPS,
the secure protocol, instead of the normal HTTP. This can happen when Google discovers a link
to your site that includes the HTTPS prefix in the URL, whether that link resides on your own
site or was posted by someone else - either accidentally or maliciously. When this happens, it can lead to Duplicate Content
issues because Google will usually see identical content on your site with both versions
of the URL. This article discusses the causes and the cure.
This problem begins when Google encounters a link to your site using HTTPS. This most often occurs on e-commerce sites, but any site that deals with private user information may well protect portions of their site with the SSL (Secure Sockets Layer) service employed via HTTPS. Ordinarily, a webmaster will block the search engines from accessing these protected areas through an instruction in the site's robots.txt file. But if you don't use robots.txt, or if the instruction is poorly crafted, the gates are open for search engines to crawl those pages as they would any other.
Once Google or other search engine crawls a page with HTTPS, it can begin to crawl the rest of the site with HTTPS if you rely on relative links on your pages. A relative link is one that uses a shorthand version of the URL in the "href" attribute of the <a>nchor tag, and does not include either the protocol ('HTTP' or 'HTTPS') or the domain name ('www.example.com'). Both search engines and browsers will use the same protocol as they used to access the page where such a link resides in constructing the complete URL to assign to that link. So, if your site relies heavily on relative links and does not take steps to prevent search engines from indexing pages with HTTPS, the problem can cascade through your entire website and disrupt your site's performance in the rankings. Here are some steps you can take to prevent this problem from hurting your website:
Start by creating a good robots.txt file. It's a good idea to
limit HTTPS access to specific directories within your site so that you can control
when and where HTTPS is used. Then you can include an instruction in your robots.txt
file to block the search engines from crawling those directories with something like:
Disallow: /directory/
There's a tool in Google's
Webmaster Tools console that will let you test your robots.txt file to make sure
that you are properly blocking all of the pages within the directories you want to protect
with HTTPS. Naturally, this is in the "an ounce of prevention" category. You'll
need to take further steps if some of your pages are already improperly indexed in the
search engines.
Use a robots <META> Tag on All Pages Using HTTPS. Using a
robots <META> tag on pages designed to be accessed with HTTPS will go a long way
toward preventing this problem. Simply add:
<meta name="robots" content="noindex,nofollow">
to the <head> section of each of these pages. This prevents the search engines
from both indexing the page where this tag resides and from following any links that
also reside on the page. If any of your pages designed for HTTPS access have already been indexed,
be sure to add this <META> tag to all such pages and then temporarily remove the
blocking instruction from your robots.txt file. This will allow the search engines to
see this <META> tag, which will cause them to remove the page from the index. Once
the pages have been removed, you can restore the blocking instruction in your robots.txt file.
Use the rel="canonical" Tag. The rel="canonical" tag tells the search engines the correct URL for a page. It's always a good idea to add this tag to your site's main page to prevent the common canonicalization problems with the "www." prefix, but it will also serve to prevent it from being indexed with HTTPS as well. You can use this tag in many situations where a page might be accessed with different URLs, and you can also use it when a page has already been improperly indexed with HTTPS. For details on the rel="canonical" tag, see Google's article Specify Your Canonical. This is both an "ounce of prevention" and a "pound of cure" that's easy to implement and does the job pretty well in a single step. You'll find another step you can take to reinforce this setting later in this article.
Use Complete URLs in Your Links. Web designers like to use relative URLs when they create webpages because it simplifies testing page layouts on their computer before uploading them to the server. But, as we've seen, this can lead to search engines following improper paths through the site once they've latched on to a link that resolves to starting with "https://". Get in the habit of using complete URLs and you'll be doing your site a big favor.
Use a Special robots.txt File For HTTPS. You can serve a special robots.txt file
when your server receives a request for /robots.txt using HTTPS. If your server uses Apache server
software, you can add an instruction in your .htaccess file to handle this, such as:
RewriteCond %{HTTPS} ^on$
RewriteCond %{REQUEST_URI} ^/robots.txt$
RewriteRule ^(.*)$ /robots_https.txt [L]
You may need a different instruction, depending on your server environment. If the above
example doesn't work for you, try:
RewriteCond %{SERVER_PORT} !80
RewriteCond %{REQUEST_URI} ^/robots.txt$
RewriteRule ^(.*)$ /robots_https.txt [L]
Next, create a special robots.txt file using a different file name. In my example, I use "robots_https.txt".
Modify the .htaccess code above to use whatever file name you choose, then fill it with:
User-agent: *
Disallow: /
This will block the search engines from using HTTPS for any URL on your site. If your server uses
Microsoft IIs software, contact your hosting service for advice on implementing this.
Redirect HTTPS Requests For Normal Pages. If some of your pages have already
been improperly indexed with HTTPS, it's a good idea to set up 301 redirects for those pages and
unblock them in your special robots.txt file (if any) so that the search engines can try to re-crawl
those pages and discover the new redirect. A sample .htaccess instruction for this would be:
RewriteCond %{HTTPS} ^on$
RewriteCond %{REQUEST_URI} !^/https-allowed-directory/(.*)?$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
or, (again) if your server uses a different HTTPS indicator field:
RewriteCond %{SERVER_PORT} !80
RewriteCond %{REQUEST_URI} !^/https-allowed-directory/(.*)?$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
Note that this example allows HTTPS access to one directory, but you may need to allow
access to individual pages rather than an entire directory. Once the search engines have
seen this redirect a few times, you should go ahead and restore the blocking instructions in your
special robots.txt file. Letting the search engines see the 301 redirects on URLs that have
been indexed with HTTPS will effectively remove them from the index.
In summary, removing pages that have been improperly indexed with HTTPS requires a bit of effort. The rel="canonical" tag is the easiest method for removing your normal pages that were indexed with HTTPS, but it can take a long time for the search engines to resolve the situation with HTTPS. Always using the robots <META> tag set to "noindex" on pages that you never want to be indexed will go a long way to preventing the problem as well. And serving a special robots.txt file is an added layer of prevention and will, in time, also repair the problem. The ultimate sledge-hammer-approach method is to also install 301 redirects for directories or individual pages that have been improperly indexed.
These steps will help reduce the risk of your site developing duplicate content or canonicalization problems, and can also remove pages from the search engine's index that have been improperly indexed with "https//". It's critical your "pound of cure" to let the search engines see the new status of the badly indexed URLs before they will remove them from the index.
This SEO Tip was last updated on March 30, 2012
Preparing Your Website for Search Engines
Search Engine Friendly Web Design
Optimization Common Mistakes
Why Is My Website Not Indexed?
Get Higher Google Ranking
Search Engine Ranking Factors
Getting Links for Your Site
Finding Keywords for Search Marketing
Search Engines and Frames
Fixing Google Canonicalization Errors
Multiple Domain Names Problems
Top 10 Search Engine Optimization Myths
Google's PageRank Explained
Site Redirect Without .htaccess
Why Did My Site's Google Ranking Drop?
Tracking Codes in Your Links/URLs
HTTP Server Response Header Checker
How To Tell If A Site is Banned
How To Set Your Website's Geo-location
Google Malware Warning
Removing/Blocking HTTPS URLs
Best Way To Change Your URLs
How To Use rel="nofollow"
Need More Help?
You'll find more SEO Tips on the menu on the right side of this page.
You can also contact me with your SEO questions.
If you can't fix your website search engine problems on your own,
my Search Engine Optimization Services
can give your website what it needs to get your fair share of search engine traffic quickly, without disturbing your website's design, and without breaking your budget.
Call Richard L. Trethewey at Rainbo Design in Minneapolis today at 612-408-4057 from 9:00 AM to 5:00 PM Central time
to get started on your affordable website design package or search engine optimization program today!