PINGDOM_CHECK

Web Scraping Copilot is live. Build Scrapy spiders 3× faster, free in VS Code.

Install Now
  • Data Services
  • Pricing
  • Login
    Sign up👋 Contact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
ON-DEMAND WEBINAR

Mastering data harmony: Techniques for matching and deduplication of scraped data

Learn the strategies for matching and deduplicating scraped data

In this workshop, Fernando delves into the complex issue of matching and deduplicating data as your web scraping projects extend across multiple data sources. Linking items between different domains, connecting products between e-commerce sites, matching real estate listings to public records, and correlating news stories between newspapers - these all pose significant challenges. 


Learning how to efficiently aggregate this information is vital for constructing a resilient database that data scientists can leverage for insights or resale to other businesses.


This workshop covers the following:  


  • Recognising the importance and challenges of data matching and deduplication in web scraping projects.

  • Exploring various approaches to tackle this issue in their pipelines, from simple solutions like sniffing unique IDs from within HTML, to complex strategies involving multimodal matching using text and image vector representations.

  • Creating robust databases using the matching and deduplication techniques learned.

  • Understanding the value of these databases to data scientists and other businesses.


For any follow-up questions after watching the webinar, join us on Discord and engage directly with Fernando.

Join Our Discord Community

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026