What is duplicate content and how to deal with it on your site

Merche Martínez

Written by Merche Martínez

One of the most common indexability problems I run into when I carry out an SEO audit of a website is duplicate content. The goal of this post is to define a guide with some key points that need to be reviewed, which we have previously detected at Human Level, thanks to the acquired experience of many years working as SEO consultants.

What is duplicate content?

Search engines index each content with a unique identifier, which is the URL of a page.

It’s absolutely essential that there is an unambiguous correspondence between a content and a URL.

Presenting the same content on two different URLs can be detected as duplicate content by search engines, which consider it to be an attempt to get more rankings in the search results. For that reason, they usually choose one URL (frequently, the oldest or the most popular one) as the “original” source for this content, and give it a better ranking, while the other ones with the same content are pushed down to the lowest positions. Here is an example of duplicate content:

www.example.com/test

www.example.com/test-2  > same content

Reviewing Google search results

Duplicate content detected by Google

Google provides us with an important indicator that something is amiss. Enter the site: operator, followed by a domain URL, just so:

site:www.example.com

This query returns all the pages that are currently indexed in Google, but what we want to see here is the last page. Let me give you a small tip to reach the last page in Google search results: turn your attention to the URL and locate the “start” parameter. Got it? If we’re at page 2, the value is 10, if we’re on page 3, the value is 20, and so on. If we change the parameter’s value to 990, it will take us to the last page, just like that: start=990.

When we have duplicate content, we will get the following message on our screen, saying that there are 298 pages that have been omitted from the results, because Google considers them to be “very similar” to the ones that are already displayed.

Omitted results on Google

If we click on the link “repeat the search with the omitted results included”, we will see which pages are being considered as duplicate content. Unfortunately, they aren’t highlighted with a flashy background or anything of the sort. If only! There is no other way but to review these search results as they are displayed.

Detecting duplicate content by reviewing the search results

We can also detect duplicate content by simply looking through the pages displayed in the search results, and their titles and snippets.

The title is one of the most important elements used to calculate the on-page relevance. If we look at the search results, it is possible we run into several identical titles. This can be an indicator of duplicate content.

If we detect several identical titles, we can further analyse them with an additional search operator, intitle:

site:www.example.com intitle:title text

Same thing happens with the snippet, which is the small fragment of text briefly describing the page content in the search results (SERPs). By default, this text usually matches the description meta tag of a page. Thoroughly checking this point is very important, because if we see duplicate snippets, it’s also likely we will find duplicate content.

“0 results” product category pages

Browsing through our site we can encounter product categories, which return no results. If these pages show messages like “0 results” or “No results found”, we can enter the following command into Google’s search bar:

site:www.example.com “No results found”

This will return all indexed pages of empty product categories. If these pages don’t have any content to differentiate them from others, they count as duplicate content.

Tools to detect duplicate content

Google Search Console

Google Search Console provides us with the “HTML improvements” section. It lists titles and descriptions that Google considers duplicate.

Duplicate titles and descriptions in GSC HTML improvements

If we click on “Duplicate titles”, we will see a list of faulty pages, and if we open the title, we will see the exact pages, where these duplicate titles have been found. We can see the same information for descriptions, too.

Screaming Frog

The Screaming Frog tool is a very useful SEO resource. Amongst its many features, it allows us to see duplicate content. Screaming Frog is a paid tool, but it has a free version, allowing us to crawl up to 500 URLs for every domain. This free version also gives us the possibility to see duplicate content, and that’s what we’re here for.

First, we need to capture existing URLs on our site. To do this, we enter our domain:

Screaming Frog enter domain

Once the crawl has finished, we need to select the “Duplicate” filter on the URI tab. We will see a list of pages with different URLs, but duplicate elements like title, description, H1, etc.

Duplicate filter Screaming Frog

Mirror domains

Duplicate content can also occur between domains. When two or more domains have the exact same content, they are called mirror domains. The most common example of mirror domains is when we have the main domain (example.com) and the www subdomain (www.example.com) with identical content. Using the command site:example.com -www you can see indexed pages with the subdomain, if there are any.

Sometimes we can also find versions of sites in pre-production or in development, which have been published and indexed. These also generate duplicate content. For example: dev.example.com, pre.example.com, or even an entirely different domain.

Pagination

Pagination is also cannon fodder for duplicate content. Let’s explore some instances of duplicate content we can find with pagination:

  • First page: it’s possible the content of the first page can be displayed both with the page=1 parameter, and without.

www.example.com/list

www.example.com/list?page=1

  • Last page: I’ve seen cases where the pagination value isn’t correctly configured for the last page, and it returns the same result for any higher value.

www.example.com/list?page=4  > last page

www.example.com/list?page=10

www.example.com/list?page=100

With regard to list pages we have detected a common error, in which they have an identical title and description. This can be easily detected by manually reviewing search results, or with Google Search Console’s “HTML improvements” help.

Navigation paths

It’s also possible we reach the same content following various navigation paths. It’s perfectly logical that some products, services or lists belong to several different categories. Here’s an example to better explain what I mean: to rent an apartment in Alicante, we could have the following navigation paths:

www.example.com > www.example.com/rent/ > www.example.com/rent/alicante/

or

www.example.com > www.example.com/alicante/ > www.example.com/alicante/rent/

In this case, both www.example.com/rent/alicante/ and www.example.com/alicante/rent/ would return the same content, thus resulting in duplicate content.

Parameter index

Earlier, we touched the surface of this issue when I talked about pagination specifically. But really, it could happen with any parameter.

It is possible that when we have a URL with parameters, we could add any other value to it and it will return the exact same result. For example, if we want to order a list, we have these options: order=ascending and order=descending. However, if the website hasn’t been correctly programmed, it is also possible that we type something like “order=blablabla” and it will still return either of the aforementioned options. This would also count as duplicate content.

Duplicate content of the home page

The home page can easily be displayed on different URLs of the same site. Some of the most frequent examples include:

www.example.com

www.example.com/home.php

www.examples.com/index.html

These pages could be linked from the main menu (start page, home page, etc), from the logo, from the footer, or any other page. And yep! You guessed it. This is duplicate content too.

Tags

Poorly managed tags can also generate tons of pages with identical content. For example, a common occurrence in blogs, is to create tags for each post, and it’s likely these tags won’t be assigned to any other article. Each of these tags generates a new page, which lists only one post. These pages are usually considered duplicate content.

External links

I’m sure in some of the instances I mention, namely “order=blablabla”, you probably thought: how is this URL going to get indexed even, if there isn’t a link pointing to it from anywhere on my site? It’s actually much easier than you would think, because it could be linked by mistake from another website, and if this page returns a 200 OK status, its content will get indexed.

Tips to prevent duplicate content

So… what should you do if you have duplicate content? I’m sorry to say, but there isn’t one single formula to defeat it, but here are a few possible solutions:

  • Canonical: in some cases, the recommendation is to include the correct canonical link element. This way, we can suggest the page we want indexed to Google.
  • Redirect 301: in some cases, it is recommended to use a permanent 301 redirect of a URL containing duplicate content to a URL we consider to be the correct one.
  • 404 error: in some cases, the page with duplicate content probably shouldn’t exist. In that case it should return a 404 error status.

As you can see, duplicate content is a frequent error many websites experience in their lifetime, but it can be detected fairly easily, and solved using the recommendations I provide in this post. Now that you have read it, it would be a good idea for you to go and review your own site, to ensure that duplicate content isn’t an issue affecting you.

Merche Martínez
Autor: Merche Martínez
SEO consultant at the Human Level online marketing agency. She's an expert in search engine optimization at both national and international levels. She's also a certified Google AdWords user.

Leave a comment

Your email address will not be published. Required fields are marked *