Arnold Alexander
1 min read ·
In this workshop, Fernando delves into the complex issue of matching and deduplicating data as your web scraping projects extend across multiple data sources. Linking items between different domains, connecting products between e-commerce sites, matching real estate listings to public records, and correlating news stories between newspapers - these all pose significant challenges.
Learning how to efficiently aggregate this information is vital for constructing a resilient database that data scientists can leverage for insights or resale to other businesses.
This workshop covers the following:
Recognising the importance and challenges of data matching and deduplication in web scraping projects.
Exploring various approaches to tackle this issue in their pipelines, from simple solutions like sniffing unique IDs from within HTML, to complex strategies involving multimodal matching using text and image vector representations.
Creating robust databases using the matching and deduplication techniques learned.
Understanding the value of these databases to data scientists and other businesses.
For any follow-up questions after watching the webinar, join us on Discord and engage directly with Fernando.
More webinars
AnnouncementA practical walkthrough of the Web Scraping Industry Report 2026, covering how AI, automation, and access controls are reshaping web data collection at scale.
2 min read
AnnouncementLearn how to prepare for modern anti-bot systems with advanced unblocking tactics.
2 min read
How ToJoin Hyder Khan | Data Engineer, @ Flipdish as he shares how to extract, clean, analyze, and visualize web data using a seamless workflow with Streamlit.
1 min read
G2.com