PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogHow ToLink Analysis Algorithms Explained
ArticleGuideHow To

Link Analysis Algorithms Explained

When scraping content from the web, you often crawl websites which you have no prior knowledge of. Link analysis algorithms are incredibly useful in these

V

Valdir Stumm Junior

6 min read · June 19, 2015

Link Analysis Algorithms Explained

Link analysis algorithms explained

When scraping content from the web, you often crawl websites which you have no prior knowledge of. Link analysis algorithms are incredibly useful in these scenarios to guide the crawler to relevant pages.

This post aims to provide a lightweight introduction to page ranking algorithms so you have a better understanding of how to implement and use them in your spiders. There will be a follow up post soon detailing how to use these algorithms in your crawl.

PageRank

PageRank is perhaps the most well known page ranking algorithm. The PageRank algorithm uses the link structure of the web to calculate a score. This score provides insight into how relevant the page is and in our case can be used to guide the crawler. Many search engines use this score to influence search results.

One possible way of coming up with PageRank is by modelling the surfing behavior of a user on the web graph. Imagine that at a given instant of time [LaTeX formula] the surfer, [LaTeX formula], is on page [LaTeX formula]. We denote this event as [LaTeX formula]. Page [LaTeX formula] has [LaTeX formula] outgoing and [LaTeX formula] incoming links as depicted in the figure.

web\_graph

While on page [LaTeX formula] the algorithm can do one of the following to continue navigating:

  1. With probability [LaTeX formula] follow randomly an outgoing link.
  2. With probability [LaTeX formula] select randomly an arbitrary page on the web. When this happens we say the the surfer has teleported to another page.

So what happens if a page has no outgoing links? It's reasonable to assume the user won't stick around, so it's assumed that the user will 'teleport', meaning they will visit another page through different means such as entering the address manually.

Now that we have a model of the user behavior let's calculate [LaTeX formula]: the probability of being at page [LaTeX formula] at time instant [LaTeX formula]. The total number of pages is [LaTeX formula].

[LaTeX formula]

Now, the probability of going from page [LaTeX formula] to page [LaTeX formula] is different depending on wether or not [LaTeX formula] links to [LaTeX formula] ([LaTeX formula]) or not ([LaTeX formula]).

If [LaTeX formula] then:

[LaTeX formula]

If [LaTeX formula] but [LaTeX formula] then the only possibility is that the user chooses to teleport and it lands on page [LaTeX formula].

[LaTeX formula]

If [LaTeX formula] and [LaTeX formula] then the only possibility is that the user teleports to page [LaTeX formula].

[LaTeX formula]

We have assumed uniform probabilities in two cases:

  • When following outgoing links we assume all links have equal probability of being visited.
  • When teleporting to another page we assume all pages have equal probability of being visited.

In the next section we'll remove this second assumption about teleporting in order to calculate personalized PageRank.

Using the formulas above, with some manipulation we obtain:

[LaTeX formula]

Finally, for convenience let's call [LaTeX formula]:

[LaTeX formula]

PageRank is defined as the limit of the above sequence:

[LaTeX formula]

In practice, however, PageRank is computed by iterating the above formula a finite number of times: either a fixed number or until changes in the PageRank score are low enough.

If we look at the formula for [LaTeX formula] we see that the PageRank of a page has two parts. One part depends on how many pages are linking to the page but the other part is distributed equally to all pages. This means that all pages are going to get at least:

[LaTeX formula]

This gives an oportunity to link spammers to artificially increase the PageRank of any page they want by maintaing link farms, which are huge amounts of pages controlled by the spammer.

As PageRank will give all pages a minimum score, all of these pages will have some PageRank that can be redirected to the page the spammer wants to rise in search results.

Spammers will try to build backlinks to their pages by linking to their sites on pages they don't own. This is most common on blog comments and forums where content is accepted from users.

web\_spam

Trying to detect web spam is a never-ending war between search engines and spammers. To help filter out spam we can use Personalized PageRank which works by not assigning a free score to undeserving pages.

Personalized PageRank

Personalized PageRank is obtained very similar to PageRank but instead of a uniform teleporting probability, each page has its own probability [LaTeX formula] of being teleported to irrespective of the originating page:

[LaTeX formula]

The update equations are therefore:
[LaTeX formula]

Of course it must be that:

[LaTeX formula]

As you can see plain PageRank is just a special case where [LaTeX formula].

There are several ways the score [LaTeX formula] can be calculated. For example, it could be computed using some text classification algorithm on the page content. Alternatively, it could be set to 1 for some set of seeds pages and 0 for the rest of pages, in which case we get TrustRank.

Of course, there are ways to defeat this algorithm:

  • Good pages can link to spam pages.
  • Spam pages could manage to get good scores, for example, adding certain keywords to its content (content spamming).
  • Link farms can be improved by duplicating good pages but altering their links. An example would be mirrors of Wikipedia which add links to spam pages.

HITS

HITS (hyperlink-induced topic search) is another link analysis algorithm that assigns two scores: hub score and authority score. A page’s hub score is influenced by the authority scores of the pages linking to it, and vice versa. Twitter makes use of HITS to suggest users to follow.

The idea is to compute for each page a pair of numbers called the hub and authority scores. A page is considered a hub when it points to lot of pages with high authority, and page has high authority if it's pointed to by many hubs.

The following graph shows one several pages with one clear hub [LaTeX formula] and two clear authorities [LaTeX formula] and [LaTeX formula].

hits\_graph

Mathematically this is expressed as:

[LaTeX formula]
[LaTeX formula]

Where [LaTeX formula] represents the hub score of page [LaTeX formula] and [LaTeX formula] represents its authority score.

Similar to PageRank, these equations are solved iteratively until they converge to the required precision. HITS was conceived as a ranking algorithm for user queries where the set of pages that were not relevant to the query were filtered out before computing HITS scores.

For the purposes of our crawler we make a compromise: authority scores are modulated with the topic specific score [LaTeX formula] to give the following modified equations:

[LaTeX formula]
[LaTeX formula]

As we can see totally irrelevant pages ([LaTeX formula]) don't contribute back authority.

HITS is slightly more expensive to run than PageRank because it has to maintains two sets of scores and also propagates scores twice. However, it's particularly useful for crawling as it propagates scores back to the parent pages, providing a more accurate prediction of the strength of a link.

Further reading:

  • HITS: Kleinberg, Jon M. "Authoritative sources in a hyperlinked environment." Journal of the ACM (JACM) 46.5 (1999): 604-632.
  • PageRank and Personalized Pagerank: Page, Lawrence, et al. "The PageRank citation ranking: Bringing order to the web." (1999).
  • TrustRank: Gyöngyi, Zoltán, Hector Garcia-Molina, and Jan Pedersen. "Combating web spam with trustrank." Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 2004.

We hope this post has served as a good introduction to page ranking algorithms and their usefulness in web scraping projects. Stay tuned for our next post where we'll show you how to use these algorithms in your Scrapy projects!

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
How To
V

Valdir Stumm Junior

More from this author

In this article

  • PageRank
  • Personalized PageRank
  • HITS

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Teaching AI to scrape like a pro: how we measure LLMs’ data quality
How To

Teaching AI to scrape like a pro: how we measure LLMs’ data quality

AI-enabled code editors can now conjure scraping code on command. But is it any good? Here’s how Zyte re-engineered LLMs with Web Scraping Copilot to drive best-in-class output.

Theresia Tanzil·10 min·February 23, 2026
Analyze web data quickly with Jupyter Notebooks and Zyte API
How To

Analyze web data quickly with Jupyter Notebooks and Zyte API

With AI Scraping in Zyte API, you can pull data from any e-commerce website straight into your Jupyter notebooks.

Neha Setia Nagpal·2 mins·December 13, 2024
Overcoming web scraping challenges of Puppeteer and Playwright
How To

Overcoming web scraping challenges of Puppeteer and Playwright

Discover the challenges of scaling web scraping with Playwright & Puppeteer, from browser farm management to IP rotation and anti-scraping tactics.

Neha Setia Nagpal·1 mins·December 5, 2024

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026