Thin Content - what it is, how to find it, and what to do with it

Introduction

SEOTesting.com has a Content Quality report which takes each url in a sitemap.xml file, and checks to see how many impressions and clicks that url has had over the last 90 days. From these results you can review these pages, potentially deem them thin content, or duplicate content, and look at resolving the issues.

Use the Content Quality report in SEOTesting.com to find out which pages get very little impressions and click in the search results.

You can find the Content Quality report in the Reports section of SEOTesting.com.

Once the report has run, you’ll see all the urls from your sitemap, ordered by the ones receiving the least impressions in the search results first. These urls, with zero or a very low number of impressions, are the first pieces of content you should check out.

What is thin content?

Thin content can be thought of as pages on your website that have little or no value - firstly to users, and ultimately to Google. There are a number of reasons your site ends up with thin content. You can assume if your site has been running for a number of years, without any kind of content audit being performed, it is likely you’ll find some content that fits into this category.

Thin content doesn’t simply mean a lack of words on a page… but as that’s an easy definition lets start there:

Lack of detail

Back in the day if you had a website that had a strong standing in Google’s eyes you could create lots of pages that each targeted an individual keyword and be able to rank relatively well for them. Sites sprang up that had hundreds of thousands of pages, that had little more than a paragraph of text on each page. Some of this content was handwritten by a bunch of low paid employees, some of it was auto-generated. Either way it led to the search results being full of lots of content that was ultimately garbage and led to frustrated users.

Google Panda was Google’s attempt to clean up the search results and remove low quality content. Initially it impacted an entire site, which meant if your site had thin content pages, potentially your whole site would be demoted in the search results.

Google Panda is now reported to work on a page by page basis, and in real time as pages are crawled and indexed, rather than big periodic updates.

It’s likely the first version of Panda may have looked simply at word count as an indicator of thin content, but with each release of Panda the algorithms have got better at understanding what type of content users expect for individual queries. Sometimes a short answer is what you are hoping to find in the search results! This is why you need to check the search results for the query you are targeting.

Out of date

Remember that blog post you published on your site back in 2009 about something related to a minor news story?

No - you probably don’t, and if you don’t remember it, other users are unlikely to find it useful!

Reviewing old content on a blog is a great way of finding out of date content that is no longer relevant, useful, interesting and potentially all of the above.

If users aren’t going to find it useful or interesting, there’s very little chance that Google will either.

Auto generated pages

Pages that are auto generated by machines rather than written by humans often fall into the thin content category. This can be because they are duplicated from other sources (via an API or scraping). As machine learning gets better at generating text and content from seed keywords, more and more sites are going to make use of generated text. But as the usage increases, expect Google’s ability to algorithmically detect auto generated text to improve as well.

Landing pages

If you have been doing any paid advertising on a site there could potentially be a bunch of landing pages, that are very similar in content and design, but have simple changes to a page title or h1 to target an individual paid keyword.

It is worth making sure these landing pages are kept out of Google’s search index as they can easily lead to duplicate content issues and keyword cannibalization.

Affiliate pages

If your site generates its revenue from affiliate marketing (where you get a small percentage of the sale if a users click is generated from your site), you need to be very careful that your pages do not simply exist to try and generate that click. Make sure your pages are adding value to the user other than just trying to get them to click on something.

Accidental duplicate content

Duplicate content is another issue that Panda was designed to solve, and while it is not a thin content problem specifically, the Content Quality report will help diagnose duplicate content issues due to investigating pages with 0 impressions and clicks in the search results.

Accidental duplicate content can happen for a number of reasons but here are a few key things to consider:

  • Has a migration to https recently been performed?
  • Does your site canonicalise to a particular domain format eg www.domain.com vs domain.com?
  • Has a staging or development server accidentally been made public and accessible by Googlebot?
  • Does your company have different sites setup for different geographical locations. Has hreflang been suitably used?
  • Search functionality on your website or faceted navigation can lead to massive duplicate content issues if not dealt with correctly.

Why is thin content a problem for your website?

You want Google to love your website!

Ultimately if Google - through it’s algorithms or via manual content raters, thinks your pages are of low value you are unlikely to get a page to rank well.

If a majority of the pages on your site are judged to be low content, you cannot discount a site wide penalty, which are always hard to recover from.

There are also specific technical reasons related to Google crawling your site. Googlebot, which is the part of Google that visits your site to discover pages and content, only spends so long on your site each day. If your site is judged to be low quality, Googlebot will spend less time crawling your site meaning pages that are updated frequently may take longer to be reflected in the search results.

What to do about thin content?

You have three main options when it comes to clearing up a site with lots of thin content. Once you have discovered the pages that are considered thin content you can do one of the following:

Delete it

Simply remove it from the site and let the url return a 404 (not found) or 410 (gone) error code.

Once you delete a lot of content from a site it is advisable to run an internal crawl of the site using ScreamingFrog or Siteblub that will highlight any internally broken links that might happen from pages being deleted.

Improve it

Make the content better. Make sure the content reflects the search query and what Google thinks the user is expecting to see. Also make sure it is as good if not better than what is currently being returned in the search results.

Redirect it

If the url has good quality external links pointing to it, you could redirect to another relevant page. Also if you are taking a number of thin content pages and rolling them up into a single article, redirecting the old individual urls to the new one would make sense.

Summary

Checking your website for thin content and content quality issues should be part of a regular audit you undertake for your site. The Content Quality report in SEOTesting.com can quickly identify pages of your site to look at, and be part of a quarterly audit you run on your site.