What are the data sources for keyword research tools?

Written by Tiago Silva. Updated on 23, August 2022

Ever wondered where your favorite keyword research tools get their keyword ideas and estimated search volume numbers from?

In this article, we take you through the different data sources these tools use:

  1. Scraping the SERPs
  2. Scraping web pages
  3. Google Search Console
  4. Google Ads Keyword Planner
  5. Clickstream data
  6. Social Media & Online Forums
  7. Google Trends

We also cover why you need to be careful with the free tools, plugins, and browser extensions you grant access to your site’s Google Search Console data.

1. Scraping the SERPs (Search Engine Results Pages)

Most keyword research tools scrape Google’s Search Engine Results Pages (SERPs) to fill their databases with popular keywords. Scraping the SERP involves using bots that can quickly check Google to get the desired information.

These keyword tools can use the Google Search API or 3rd-party methods like ScrapingBee or Outscrapper.

SEO tools are scraping keyword ideas from:

  • Autocomplete.
  • Related searches.
  • People Also Ask (PAA).
  • People Also Search For (PASF).

Autocomplete Keywords (Suggestions)

Google shows suggested queries when a user starts typing in their search bar.

This is known as Google Autocomplete.

     Google Autocomplete suggestions is a popular keyword source.

The most common way to get a list of keywords from autocomplete is using the alphabet soup technique. This is nothing more than using the seed keyword and going through all the letters from the alphabet to get suggestions from Google.

To build a big list of keyword ideas, tools also check for questions, prepositions, and comparisons by using words that modify the search intent.

Tools that scrape Google Autocomplete will give you many keyword ideas, and the best part is they should have some search volume because Google is pushing them to users.

The downside is that everyone can go after the same keywords because Google recommends/suggests them. Ultimately this means more competition, a higher CPC (Cost Per Click), and more backlinks needed to rank.

Related Searches

At the very bottom of the SERP, Google displays 'related searches' to the current search term/s.

This is how they show alternative searches to users reaching the bottom of the page without clicking on any results.

However, Google Related Searches has some overlap with suggestions from Autocomplete.

This scraping method will give you an alternative to the current query and more specific searches.

Unfortunately, Google only outputs around eight keywords with this feature. That's a low volume of related searches per query if you ask me.

     Google related searches in the end of SERP.

People Also Ask (PAA)

The People Also Ask section of the SERPs displays questions that are related to the user's query.

     People Also Ask box shows related questions to the query.

When someone interacts with this feature, it will expand to show you a featured snippet that answers the PAA query. It will also expand and show other related queries to the one you clicked on.

Keyword research tools scrape these PAA sections as it’s a great source of keywords and content ideas.

PAA boxes have become increasingly popular as a keyword research process, mainly by showing long tail keywords and questions that you can quickly add to an existing article.

While the PAA featured snippet answer will only display a snippet from a single page, a PAA suggestion is a search query within itself which users will be typing directly into Google and seeing SERP results for.

People Also Search For (PASF)

People Also Search For is a Google feature that only shows up after you click on a result and return to the SERP - via clicking the back button on the browser.

     People Also Search Box appears after a user returns to the SERP.

This way, Google shows alternative searches right below the result someone just clicked.

For a user, this offers the option to refine the search without having to scroll further.

For SEO tools, this is an opportunity to see what keywords Google thinks are related to the current query.

2. Scraping The Pages That Rank For A Query

Some tools will analyze the individual page results for a particular query and find other keywords a page ranks for.

They analyze the content of these pages to find commonly used words, and entities.

This is effectively trying to reverse engineer what Google does when it crawls and analyzes a page.

Google's NLP (Natural Language Processing) is one of the algorithms that tools might use to interpret the entities and their relationship between each keyword.

Some tools might focus on simpler factors like keyword frequency, similar keywords, and content overlap between the top pages.

This method has the advantage of training algorithms with thousands of SERPs, and means that over time they can notice the most important and mentioned entities in the top results.

3. Google Search Console Data

Google Search Console is one of the most important data sources a website owner has at their disposal. So much so that granting access to GSC data is like giving a spare key to your house's front door!

Some services with Google Search Console access will use this data to do what they call "improving the service" - in plain English, getting more accurate keyword volumes based on real data.

To improve keyword volumes, SEO tools can access GSC data from sites to which they have access. This can happen with a free or paid tool.

Having a tool trained with Google Search Console data is invaluable. It can give a competitive advantage to the developers and customers of said tool.

To know if a tool sucks up your Google Search Console data, check their terms of services and privacy policies.

Note: SEOTesting does not use your Google Search Console data for any other purpose than to display it to you. We do not share or sell it to 3rd-parties. 

Be Wary of Free Tools

Every service has a cost to run and maintain. But some companies have free services with the goal of mining user data.

This can happen when free keyword research tools or WordPress plugins have Google Search Console access.

An example could be a free plugin that displays data from Google Search Console inside the WordPress dashboard.

These tools tempt users because they are convenient. Then when they have access to data, they might use, share or sell it to build keyword estimations with real-world data.

Read the terms and conditions of every service that asks for access to Google Search Console to know if they use your data to improve their tools. Especially when the service is free.

4. Google Ads Keyword Planner (GKP)

Google Ads Keyword Planner is a tool for advertisers to create paid media campaigns.

At the time of release, GKP was extremely useful for SEOs to get keyword suggestions, search volume, and ideas.

GKP is seen as a reliable source because it's an official Google product, showing keyword volumes based on the last 12 months.

Getting keyword volume by country also explains why Keyword Planner became so popular.

GKP is still one of the most used keyword sources, especially amongst free SEO tools.

Google Ads Keyword Planner is free to access, but Google doesn't want people to use the data for SEO. This is why they have stopped showing exact keyword volumes for low-spending accounts and started showing ranges instead. Google also started to group synonyms and misspellings, which can sometimes cause discrepancies in the volumes.

Ahrefs commissioned a study that found that GKP overestimates search volume for more than 91% of their keywords and is "roughly accurate" 45% of the time.

This means that using GKP as your only research tool isn't as accurate as it was in the past.

5. Clickstream Data

Clickstream is anonymized data gathered from tracking the users across the internet.

This usually includes:

  • Unique device identification.
  • IP address.
  • Device type.
  • Operating system.
  • Country.
  • Language.
  • Timestamp.
  • Referral URL.
  • Time on page.

Clickstream data is then aggregated and used to construct models that help estimate monthly search volume for keywords.

By their very nature, head search terms get a lot more searches and so appear more in click stream data. This means it’s much easier to estimate the monthly search volume for these popular queries.

Long tail keywords appear less often, especially in a smaller sample of click stream data, so it’s much harder to estimate the monthly search volume for these keywords.

This is why tools that rely on clickstream data are very accurate when estimating the search volume for popular head term keywords but less so at estimating for long tail keywords.

Click steam data will also show where users start searching, what they click, how many times, and what the referral is before converting. According to Google, the user journey goes through a "messy middle", and having access to the data via click stream sources allows tools to suggest the type of content to produce more of.

When it comes to collecting clickstream data, there are many ways to do it, and some are more transparent than others.

One of those is when users voluntarily sign up for a program that will track them. These users are commonly known as panelists.

Less ethical methods might include browser extensions and free VPN / Antivirus services tracking user data.

These tools are in a prime position to track a user's online activity without them knowing.

Most of them have access to all visited sites. Meaning they can know about every page a user visits. That's why I'm not surprised when there are reports of browser extensions selling user data.

Fortunately, not all extensions that want to track users hide it on their terms and conditions. Some, like SimilarWeb, say clearly and upfront that a user needs to become a contributor to access their data.

That's how it should be: upfront disclosure of data collection and not burying it in terms of service.

6. Social Media & Online Forums

Social media sites and forums can be a goldmine of relevant keywords.

Crawling, scraping, and analyzing these can find topics and keywords with genuine interest from users - as they are writing about it publicly.

Google Trends is an interesting keyword source showing search interest from Google Search, News, Shopping, and YouTube.

There is data available ranging from the last hour all the way back to 2004!

Google Trends categorizes the topic/keyword popularity from 0 to 100, where 100 means all-time peak interest for that search.

Keyword tools can use this data in correlation with Google Keyword Planner for better estimating search volume around the year.

This helps them understand a topic’s seasonality and determine whether interest is trending up or down.

Google Trends can also show related searches by country and the growth for that period.

Summary

In this article, we reviewed all the sources that keyword research tools use to get their suggestions and monthly search volume estimates. Premium tools tend to use a combination of sources to create their own monthly search volume models.

This helps explain why you see such differences in the keyword volume from one tool to another.