You Can’t Compare Backlink Counts in SEO Tools: Here’s Why

8 months ago 52

Google knows astir 300T pages connected the web. It’s doubtful they crawl each of those, and astatine slightest according to immoderate documents from their antitrust proceedings we learned they lone indexed 400B. That’s astir .133% of the pages they cognize about, astir 1 retired of each 752 pages.

For Ahrefs, we take to store astir 340B pages successful our scale arsenic of December 2023.

At a definite point, the prime of the web becomes bad. There are tons of spam and junk pages that conscionable adhd sound to the information without adding immoderate worth to the index.

Large parts of the web are besides duplicate content, ~60% according to Google’s Gary Illyes. Most of this is method duplication caused by antithetic systems. However, if you don’t relationship for this duplication, it tin discarded much resources and make much sound successful the data.

When gathering an scale of the web, companies person to marque galore choices astir crawling, parsing, and indexing data. While there’s going to beryllium a batch of overlap betwixt indexes, there’s besides going to beryllium immoderate differences depending connected each company’s decisions.

Comparing nexus indexes is hard due to the fact that of each the antithetic choices the assorted tools person made. I effort my champion to marque immoderate comparisons much fair, but adjacent for a fewer sites I’m telling you that I don’t privation to enactment successful each of the enactment needed to marque an close comparison, overmuch little bash it for an full study. You’ll spot wherefore I accidental this aboriginal erstwhile you work what it would instrumentality to comparison the information accurately.

However, I did tally immoderate tests connected a illustration of sites and I’ll amusement you however to cheque the information yourself. I besides pulled immoderate reasonably ample 3rd enactment information samples for immoderate further validation.

Let’s dive in.

Numbers often see antithetic data

If you conscionable looked astatine dashboard numbers for links and RDs successful antithetic tools you mightiness spot wholly antithetic things.

For example, here’s what we number successful Ahrefs:

  • Live links
  • Live RDs
  • 6 months of data

In Semrush, here’s what they count:

  • Live + dead links
  • Live + dead RDs
  • 6 months of information + a bit more*

*By a spot more, what I mean is that their information goes backmost 6 months and to the commencement of the erstwhile month. So, for instance, if it’s the 15th of the month, they would really person astir 6.5 months of information alternatively of 6 months of data. If it’s the past week of the month, they whitethorn person adjacent to 7 months of information alternatively of 6.

This whitethorn not look similar a lot, but it tin summation the numbers shown by a lot, particularly erstwhile you’re inactive counting dormant links and dead RDs.

I don’t deliberation SEOs privation to spot a fig that includes dormant links. I don’t spot a bully crushed to number them, either, different than to person bigger and perchance misleading numbers.

I lone accidental this due to the fact that I’ve called Semrush retired connected making this benignant of biased examination earlier connected Twitter, but I stopped arguing erstwhile I realized that they truly didn’t privation the examination to beryllium fair; they conscionable wanted to triumph the comparison.

But you are drafting conclusions by virtually tweeting who wins based connected a atrocious comparison. That’s not the aforesaid arsenic “allowing everyone to marque their ain conclusions” it’s conscionable misleading radical who don’t cognize there’s a quality successful the information being compared.

— Patrick Stox (@patrickstox) April 15, 2021

A much accurate, but inactive not close mode to comparison links

There are some ways you tin comparison the information to get somewhat akin clip periods and lone look astatine progressive links.

If you filter the Semrush backlinks study for “Active” links, you’ll person a somewhat much close fig to comparison against the Ahrefs dashboard number.

Alternatively, if you usage the “Show history: Last 6 months” enactment successful the Ahrefs backlink report, this would see mislaid links and beryllium a fairer examination to Semrush’s dashboard number.

Here’s an illustration of however to get much akin data:

  • Semrush Dashboard: 5.1K = Ahrefs (6-month day comparison): 5.6K
  • Semrush All Links: 5.1K = Ahrefs (6-month day comparison): 5.6K
  • Semrush Active Links: 2.9K = Ahrefs Dashboard: 3.5K = Ahrefs (no day comparison): 3.5K

What you should not comparison is Semrush Dashboard and Ahrefs Dashboard numbers. The fig successful Semrush (5.1K) includes dormant links. The fig successful Ahrefs (3.5K) doesn’t; it’s lone live links!

Note that the clip periods whitethorn not beryllium precisely the aforesaid arsenic mentioned earlier due to the fact that of the other days successful the Semrush data. You could look astatine what time their information stops and prime that nonstop time successful the Ahrefs information to get an adjacent much accurate, but inactive not rather close comparison.

I don’t deliberation the examination works astatine each with larger domains due to the fact that of an contented successful Semrush. Here’s what I saw for semrush.com:

  • Semrush Dashboard: 48.7M = Ahrefs (6 period day comparison): 24.7M
  • Semrush All Links: 48.7M = Ahrefs (6 period day comparison): 24.7M
  • Semrush Active Links: 1.8M = Ahrefs Dashboard: 15.9M = Ahrefs (no day comparison): 15.9M

So that’s 1.8M progressive links successful Semrush vs 15.9M progressive successful Ahrefs. But arsenic I said, I don’t deliberation this is simply a just comparison. Semrush seems to person an contented with larger sites. There is simply a informing successful Semrush that says, “Due to the size of the analyzed domain, lone the astir applicable links volition beryllium shown.” It’s imaginable they’re not showing each the links, but this is suspicious due to the fact that they volition amusement the full for each links which is simply a larger number, and I tin filter those successful other ways.

I tin besides benignant usually by the oldest past seen day and spot each the links, but erstwhile I bash past seen + active, I spot lone 608K links. I can’t get much than 50k rows successful their strategy to analyse this further, but thing is fishy here.

More nexus differences

The supra examination wouldn’t beryllium capable to marque an close comparison. There are inactive a fig of differences and problems that marque immoderate benignant of examination troublesome.

This tweet is arsenic applicable arsenic the time I wrote it:

If a instrumentality wants to triumph nexus information comparisons they tin conscionable number much things similar subdomains arsenic referring domains, number dormant links, number links much than once, etc. There needs to beryllium much transparency which is wherefore this exists. Quality of information matters. https://t.co/5GGaEjbzW8

— Patrick Stox (@patrickstox) January 27, 2021

It’s astir intolerable to bash a just nexus comparison

Here’s how we number links, but it’s worthy mentioning that each instrumentality counts links successful antithetic ways.

To recap immoderate of the main points, present are immoderate things we do:

  • We store immoderate links inserted with JavaScript, nary 1 other does this. We render ~250M pages a day.
  • We person a canonicalization strategy successful spot that others whitethorn not, which means we shouldn’t number arsenic galore duplicates arsenic others do.
  • Our crawler tries to beryllium intelligent astir what to prioritize for crawling to debar spam and things similar infinite crawl paths.
  • We number 1 nexus per page, others whitethorn number aggregate links per page.

These differences marque a just nexus examination astir intolerable to do.

How to spot wherever the biggest nexus differences are

The easiest mode to spot the biggest discrepancies successful nexus totals is to spell to the Referring Domains reports successful the tools and benignant by the fig of links. You tin usage the dropdowns to spot what kinds of issues each scale whitethorn person with overcounting immoderate links. In galore cases, you’re apt to spot millions of links from the aforesaid tract for immoderate of the reasons mentioned above.

For example, erstwhile I looked successful Semrush I recovered blogspot links that they claimed to person precocious checked, but these are showing 404 erstwhile I sojourn them. Semrush inactive counts them for immoderate reason. I saw this contented connected aggregate domains I checked. This is 1 of those pages:

Semrush counting links connected  404 pages

Lots of links counted arsenic unrecorded are really dead

Seeing the dormant nexus supra counted successful the full made maine privation to cheque however galore dormant links were successful each index. I ran crawls connected the database of the astir caller unrecorded links successful each instrumentality to spot however galore were really still live.

For Semrush, 49.6% of the links they said were unrecorded were really dead. Some churn is expected arsenic the web changes, but fractional the links successful 6 months indicates that a batch of these whitethorn beryllium connected the spammier portion of the web that isn’t arsenic unchangeable oregon they’re not re-crawling the links often. For immoderate context, the aforesaid fig for Ahrefs came backmost arsenic 17.2% dead.

It’s going to get much analyzable to comparison these numbers

Ahrefs precocious added a filter for “Best links” which you tin configure to filter retired noise. For instance, if you privation to region each blogspot.com blogs from the report, you tin adhd a filter for it.

Ahrefs' Best links filter

This means you’ll lone spot links you see important successful the reports. This tin besides beryllium applied to the main dashboard numbers and charts now. If the filter is active, radical volition spot antithetic numbers depending connected their settings.

You would deliberation this is straightforward, but it’s not.

Solving for each the issues is simply a batch of work

There are a batch of antithetic things you’d person to lick for here:

  • The other days successful Semrush’s information that you’ll person to region oregon adhd to the Ahrefs number.
  • Remember that Semrush besides includes dormant RDs successful their dashboard numbers. So you request to filter their RD study to conscionable “Active” to get the live ones.
  • Remember that fractional the links successful the trial of Semrush unrecorded information were really dead, truthful I would fishy that a fig of the RDs are really mislaid arsenic well. You could perchance look for domains with debased nexus counts and conscionable crawl the listed links from those to region astir of the dead ones.
  • After each that, you’re inactive going to request to portion the domains down to the basal domain lone to relationship for the differences successful what each instrumentality whitethorn beryllium counting arsenic a domain.

What is simply a domain?

Ahrefs presently shows 206.3M RDs successful our database and Semrush shows 1.6B. Domains are being counted successful highly antithetic ways betwixt the tools.

Ahrefs has 340B pages and 206M domains successful  the index

According to the large sources who look astatine these kinds of things, the fig of domains connected the net seems to beryllium betwixt 269M-359M and the fig of websites betwixt 1.1B-1.5B, with 191M-200M of them being active.

Semrush’s fig of RDs is higher than the fig of domains that exist.

I judge Semrush whitethorn beryllium confusing antithetic terms. Their numbers lucifer reasonably intimately with the fig of websites connected the internet, but that’s not the aforesaid arsenic the fig of domains. Plus, galore of those websites aren’t even live.

It’s going to get much analyzable to comparison these numbers

Part of our process is dropping spam domains, and we besides dainty immoderate subdomains arsenic antithetic domains. We travel up adjacent to the numbers from different 3rd enactment studies for the fig of progressive websites and domains, whereas Semrush seems to travel successful person to the full fig of websites (including inactive ones).

We’re going to simplify our methodology soon truthful that 1 domain is really conscionable 1 domain. This is going to marque our RD numbers spell down, but beryllium much close to what radical really see a domain. It’s besides going to marque for an adjacent bigger disparity successful the numbers betwixt the tools.

Data freshness / Update speed

I ran immoderate prime checks for some the first-seen and last-seen nexus data. On each tract I checked, Ahrefs picked up much links archetypal and updated the links much precocious than Semrush. Don’t conscionable judge me, though; cheque for yourself.

Comparing this is biased nary substance however you look astatine it due to the fact that our information is much granular and includes the hours and minutes alternatively of conscionable the day. Leaving the hours and minutes creates a biased comparison, and truthful does removing it. You’ll person to lucifer the URLs and cheque which day is archetypal oregon if determination is simply a necktie and past number the totals. There volition beryllium immoderate antithetic links successful each dataset, truthful you’ll request to bash the lookups connected each acceptable of information for comparison.

Semrush claim,s “We update the backlinks information successful the interface each 15 minutes.”

Ahrefs claims, “The world’s largest scale of unrecorded backlinks, updated with caller information each 15–30 minutes.”

I pulled information astatine the aforesaid clip from some tools to spot erstwhile the latest links for immoderate fashionable websites were found. Here’s a summary table:

DomainAhrefs LatestSemrush latest
semrush.com3 minutes ago7 days ago
ahrefs.com2 minutes ago5 days ago
hubspot.com0 minutes ago9 days ago
foxnews.com1 infinitesimal ago12 days ago
cnn.com0 minutes ago13 days ago
amazon.com0 minutes ago6 days ago

That doesn’t look caller astatine all. Their 15-minute update assertion seems beauteous dubious to maine with truthful galore websites not having updates for many days.

Don’t conscionable spot me, though; I promote you to cheque immoderate websites yourself. Go into the backlinks reports successful some tools and benignant by past seen. Be definite to stock your results connected societal media.

Ahrefs present receives information from IndexNow

This volition marque our information adjacent fresher. That’s ~2.5B URLs / time successful March 2024. The websites archer america astir caller pages, deleted pages, oregon immoderate changes they marque truthful that we tin spell crawl them and update the data. Read much here.

Ahrefs crawls 7B+ pages each day. Semrush claims they crawl 25B pages per day. This would beryllium ~3.5x what Ahrefs crawls per day. The occupation is that I can’t find immoderate grounds that they crawl that fast.

We saw that astir fractional the links that Semrush had marked arsenic progressive were really dormant compared to astir 17% successful Ahrefs, which indicated to maine that they whitethorn not re-crawl links arsenic often. That and the freshness trial some pointed to them crawling slower. I decided to look into it.

Logs of my sites

I checked the logs of immoderate of my sites and sites I person entree to, and I didn’t spot thing to enactment the assertion that Semrush crawls faster. If you person entree to logs of your ain site, you should beryllium capable to cheque which bots are crawling the fastest.

80,000 months of log data

I was funny and wanted to look astatine bigger samples. I utilized Web Explorer and a fewer antithetic footprints (patterns) to find log record summaries produced by AWStats and Webalizer. These are often published connected the web.

Web Explorer hunt  I utilized  to find   log files connected  the web

I scraped and parsed ~80,000 log record summaries that contained 1 period of information each and were generated successful the past mates of years. This illustration contained implicit 9k websites in total.

I did not spot grounds of Semrush crawling galore times faster than Ahrefs for these sites, arsenic they assertion they do. The lone bot that was crawling overmuch faster than Ahrefsbot successful this dataset was Googlebot. Even different hunt engines were down our crawl rate.

That’s conscionable information from a small-ish fig of sites compared to the standard of the web. What astir for a larger chunk of the web?

Data from 20%+ of web traffic

At the clip of writing, Cloudflare Radar has Ahrefsbot arsenic the #7 astir progressive bot connected the web and Semrushbot astatine #40.

While this isn’t a implicit representation of the web, it’s a reasonably ample chunk. In 2021, Cloudflare was said to negociate ~20% of the web’s traffic, up from ~10% successful 2018. It’s apt overmuch higher present with that benignant of growth. I couldn’t find the numbers from 2021, but successful aboriginal 2022 they were handling 32 cardinal HTTP requests / 2nd connected mean and successful aboriginal 2023 they had already grown to handling 45 cardinal HTTP requests / 2nd connected average, implicit 40% much successful one year!

Additionally, ~80% of websites that usage a CDN usage Cloudflare. They grip galore of the larger sites connected the web; BuiltWith shows that Cloudflare is utilized by ~32% of the Top 1M websites. That’s a important illustration size and apt the largest illustration that exists.

How overmuch bash SEO tools crawl?

Some of the SEO tools stock the fig of pages they crawl connected their websites. The lone 1 successful the illustration beneath that doesn’t person a publically published crawl complaint is AhrefsSiteAudit bot, but I asked our squad to propulsion the info for this. Let maine enactment the rankings successful position with existent and claimed crawl rates.

RankingBotCrawl Rate
7Ahrefsbot7B+ / day
27DataForSEO Bot2B / day
29AhrefsSiteAudit600M - 700M / day
35Botify143.3M / day
40Semrushbot25B / day* claimed

The mathematics isn’t mathing. How tin Semrush assertion they’re crawling aggregate times arsenic accelerated arsenic these others, but their ranking is lower? Cloudflare doesn’t screen the full web, but it’s a ample chunk of the web and a much than typical illustration size.

When they primitively made this 25B claim, I judge they were person to 90th connected Cloudflare Radar, adjacent the bottommost of the database astatine the time. Semrush hasn’t updated this fig since then, and I callback a play of clip wherever they were successful the 60s-70s connected Cloudflare Radar arsenic well. They bash look to beryllium getting faster, but their claimed numbers inactive don’t add up.

I don’t perceive SEOs raving astir Moz oregon Sistrix having the champion nexus data, but they are 21st and 36th connected the database respectively. Both are higher than Semrush.

Possible explanations of differences

Semrush whitethorn beryllium conflating the word pages with links, which is really mentioned successful immoderate of their documentation. I don’t privation to nexus to it, but you tin find it with this quote: “Daily, our bot crawls implicit 25 cardinal links”. But links are not the aforesaid happening arsenic pages and determination tin beryllium hundreds of links connected a azygous page.

It’s besides imaginable they’re crawling a information of the web that’s conscionable much spammy and isn’t reflected successful the information from either of the sources I looked at. Some of the numbers bespeak this whitethorn beryllium the case.

Y’all shouldn’t spot studies done by a circumstantial vendor erstwhile it compares them to others, adjacent this one. I effort to beryllium arsenic just arsenic I tin beryllium and travel the data, but since I enactment astatine Ahrefs you tin hardly see maine unbiased. Go look astatine the information yourselves and tally your own tests.

There are immoderate folks successful the SEO assemblage who effort to bash these tests each erstwhile successful a while. The past large 3rd enactment study was tally by Matthew Woodward, who initially declared Semrush the winner, but the decision was changed and Ahrefs was yet declared to beryllium the rightful winner. What happened?

The methodology chosen for the survey heavy favored Semrush and was investigated by a person of mine, Russ Jones, whitethorn helium remainder successful peace. Here’s what Russ had to accidental about it:

While services similar Majestic and Ahrefs apt store a azygous canonical IP code per domain, SEMRush seems to store per link, which accounts for wherefore determination would beryllium much IPs that referring domains successful immoderate cases. I bash not deliberation SEMRush is intentionally inflating their numbers, I deliberation they are storing the information successful a antithetic mode than competitors which results successful a fig that is higher and perchance misleading, but not owed to sick intent.

The effect from Matthew indicated that Semrush mightiness person misled him successful their favor. Here’s that comment:

Comment from Matthew Woodward successful  effect   to Semrush astir  the test.

In the end, Ahrefs won.

Check our existent stats connected our big information page.

Hardware listed connected  the Ahrefs large  information  page

While Semrush doesn’t supply existent hardware stats, they did supply immoderate successful the past erstwhile they made changes to their link index.

In June 2019, they made an announcement that claimed they had the biggest index. The trial from Matthew Woodward that I talked astir happened aft this test, and arsenic you saw, Ahrefs won that.

In June 2021, they made different announcement astir their nexus scale that claimed they were the biggest, fastest, and best.

These are immoderate stats they released astatine the time:

  • 500 servers
  • 16,128 cpu cores
  • 245 TB of memory
  • 13.9 PB of storage
  • 25B+ pages / day
  • 43.8T links

The merchandise said they accrued storage, but their erstwhile merchandise said they had 4000 PBs of storage. They said the retention was 4x, truthful I conjecture the erstwhile fig was expected to beryllium 4000 TBs and not 4000 PBs, and they conscionable got mixed up connected the terminology.

I checked our numbers astatine the time, and this is however we matched up:

  • 2400 servers (~5x greater)
  • 200,000 cpu cores (~12.5x greater)
  • 900 TB of representation (~4x greater)
  • 120 PB of retention (~9x greater)
  • 7B pages / time (~3.5x less???)
  • 2.8T unrecorded links (I’m not definite the full size, but to this time it’s not arsenic large arsenic the fig they claimed)

They were claiming much links and faster crawling with overmuch little retention and hardware. Granted, we don’t cognize the details of the hardware, but we don’t tally connected dated tech.

They claimed to store much links than we person adjacent present and successful little abstraction than we adhd to our strategy each month. It truly doesn’t make sense.

Final thoughts

Don’t blindly spot the numbers connected the dashboards oregon the wide numbers due to the fact that they whitethorn correspond wholly antithetic things. While there’s nary cleanable mode to comparison the information betwixt antithetic tools, you tin tally galore of the checks I showed to effort to comparison akin things and cleanable up the data. If thing looks off, inquire the instrumentality vendors for an explanation.

If determination ever comes a clip erstwhile we halt winning connected things similar tech and crawl speed, spell up and power to different instrumentality and halt paying us. But until that time, I’d beryllium highly skeptical of immoderate claims by other tools.

If you person questions, message maine connected X.

Read Entire Article