Let's start with the canonicalization issue that is almost exclusively limited
to Google. By the strictest definitions, the two URLs "http://yoursite.com"
and "http://www.yoursite.com" are separate and distinct entities. The first is technically
pointed to the domain's root directory, and the second is pointing to a subdomain named
"www". In the earliest days of the World Wide Web, the contents of a website were
actually stored in a directory named with the standard abbreviation "www" by convention. Thus, the common
practice of making a website's URL begin with that prefix was born. But as the Internet
became more popular, and webmasters and IT managers dictated making allowances for the
less techno-savvy in the population, various shorthand methods crept into usage. The
one we deal with here is the making of the www prefix optional. I'm sure it seemed a
natural thing to do. When referring to a website by its URL, the "www" part is frequently
omitted both in speech and in writing, so it was only logical that users would similarly
take the same shortcut when they went online. So, rather than frustrate those users needlessly,
servers were configured to allow either version to retrieve the same content. Users
were happy, IT managers were happy, and webmasters were happy. But being the product
of computer-based logic, search engine algorithms often fail to understand when they
should treat these two URLs as one and the same. Google has remained particularly stubborn
about this issue, despite overwhelming evidence of the problems it causes. They even
have a page
in their support section that deals with it. Google now provides a method for webmasters
to select a Preffered Domain in their Google
Webmaster Tools. But this tool is only for Google and you should still install the 301
redirect, if at all possible. If you can't install a 301 redirect, there are other solutions
like a <meta> refresh tag and the new rel="canonical" tag.
The problem is two-fold. First, there is the issue of link popularity.
Google's vaunted PageRank system depends on links and it will not always canonicalize
(ie. treat as identical) URLs in links that omit the www and the version that includes it. This
often means lower rankings for the site for most searches than it actually deserves. Second,
and a frequent result of the first, Google won't deep crawl one version of the URL
or the other based on either (a) the reduced link popularity/PageRank, or (b) duplicate
content issues. Having the same content available from more than one URL is a violation
of the guidelines of all major search engines and this www issue is one of the most common causes of
canonicalization problems in Google. Fewer pages logged for a site means that
once again, one version of the URL is not receiving proper link popularity credit
for its own internal links.
So the problem compounds itself over time, and can be
especially debilitating to sites that weren't all that strong to begin with. Sadly,
webmasters are often partially responsible for this problem because, knowing they
can "get away with it", they will use the shorthand version when submitting
their site to directories or posting links on webpages of their own design. Once
this Genie is out of the bottle, its a long battle to overcome because even if you
are able to find every incorrect link on your own site, all it takes is a mal-formed
link on an obscure page that doesn't show up in Google's "link:" command
to keep this demon haunting you forever. Fortunately, there is a solution.
The solution is to use server control methods to automatically
redirect requests to the proper URL. The server must return a "301 Moved
Permanently" result code in order for the search engines to properly assign
the link popularity and to update their internal records of the page's true URL.
Websites running on hosts that use the Apache server software
usually have it the easiest in this regard because they can control this problem
on their own using the .htaccess control file. Just create a simple
text file named ".htaccess" with no filename extension, and insert the
following command:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^yoursite.com$
RewriteRule ^(.*)$ http://www.yoursite.com/$1 [R=301,L]
Simply replace "yoursite.com" in the above code with
your website's domain name. Websites based on Microsoft's IIS Server Software will
need to consult their system administrator for help. Again, be sure the server
returns the redirecting result code #301 or you're only papering over the problem
and not repairing it. A code 302 result is not acceptable because 302 means
"Moved Temporarily" and doesn't repair canonicalization problems.
You can check the code your site returns with my Server Header Checker.
Canonicalization Problems with Session IDs & Dynamic URLs
The www issue is only one place where canonicalization problems occur. Anytime
the search engines encounter a page that is essentially identical to another page, they will try
to select the best, or "canonical" version, and filter any duplicates from their index.
As with the www issue, this can hurt your site's performance in the search engines. The proliferation
of BLOGs and other content management systems has brought canonicalization problems to many websites
because those programs routinely create multiple URLs that point to the same content, resulting in
canonicalization issues. The search engines are becoming more adept at detecting and dealing with
the most common canonicalization problems in BLOGs and forums, but it's up to the individual webmaster to take
steps to prevent the problem from arising in the first place. Fortunately, most BLOGs are supported
by a community of talented programmers who have created add-ons for BLOGs that can reduce the number
canonicalization problems.
Ecommerce websites have their own problems with canonicalization. Many shopping
cart programs require users to accept cookies in their browser or they will add what are called
"Session IDs" to every link. Since search engine crawlers don't accept cookies, they have
traditionally avoided crawling any URL that included a Session ID or other user indentification
value. Another place where ecommerce sites can create canonicalization issues is when they use
features like sorting lists of products by price, color, or size, etc. The search engines see these
pages containing nearly identical content and suppress them. Fortunately, two of the major
search engines - Google and Yahoo! - now provide tools for webmasters to manage these problems
involving dynamic URLs. Naturally, you need to register and verify your site in order to use
these tools. Assuming you've already done so, here's how they work:
In Google's Webmaster Tools console, you can tell Google to ignore parameters
in query strings, such as session IDs. Click on "Site Configuration", then "Settings", and
you'll see a section titled, "Parameter Handling". Click on "Adjust parameter settings".
You'll see a text box labeled "parameter name". Enter the name your site gives to the parameter
for your session ID (for example, osCommerce uses "oscSid"). Then choose "Ignore"
from the drop-down menu titled, "Action". Soon, Google will filter out that parameter
from the URLs for your site, and will start to properly index any URLs that would have caused a
problem in the past.
Yahoo! Site Explorer has a similar
tool. Select your site from the "My Sites" list. Then click on "Actions", followed
by "Dynamic URLs" in the menu on the left. Enter
the parameter name in the appropriate text box, and choose "Remove From URLs". This will
have the same effect as the Google tool. It will begin to filter the named parameter from the URLs
it encounters for your domain.
Repairing Canonicalization Without Redirects
Many webmasters don't have access to server redirect tools like Apache's
.htaccess file, so they can't install conventional redirects to solve canonicalization problems.
Fortunately, there is a simple alternative.
In February 2009, the major search engines gave all webmasters a very powerful and easy-to-use
method of preventing and repairing canonicalization tools. The four largest search engines:
Google, Yahoo!, MSN, and Ask.com have all agreed to support a new canonicalization attribute for
the <link> tag that goes in the <head> section of your HTML documents. The syntax is
as follows:
<link rel="canonical" href="http://www.yoursite.com/" />
This tag will be used as "a very strong hint" in determining the canonical version of
a URL within a single domain. It is treated as a 301 redirect for such purposes. For more information, see
the Google Webmaster Blog post:
Specify Your Canonical, and Matt Cutts' article:
Learn About The Canonical Link Element in 5 Minutes. Both are well worth reading, but Matt Cutts really
explains the impact on rankings and ideas for when it's appropriate to take action.
Site owners who operate multiple websites for a single company or organization
face the issue of the best way to deal with duplicate content on pages that contain information like
contact details or terms of use that are common to all of their websites. Overall, there is no reason
to worry about duplicate content for these pages since they are rarely pages that need to rank well.
You should simply provide a clear navigation path for users who are looking for such information in
the normal design of your website, and let your other pages carry the burden for ranking issues.