PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogHow ToA Practical Guide To Web Data QA Part IV
ArticleGuideHow To

A Practical Guide To Web Data QA Part IV

Here comes the 4th part of our web data quality assurance series. Learn about semi-automated techniques, methods and tools from the experts.

I

Ivan Ivanov, Warley Lopes

7 min read · September 3, 2020

A Practical Guide To Web Data QA Part IV

A practical guide to web data QA part IV: Complementing semi-automated techniques

If you haven’t read the previous ones, here’s the first part, the second and third part of the series.

In this article, we build upon some of the semi-automated techniques and tools introduced in the previous part of the series.

1. Text editor search-and-replace techniques

Let’s say that the data we work with is separated by comma and line breaks:

1change,deep-thoughts,thinking,world
2abilities,choices
Copy

However, there isn’t a consistency of how many words are separated by comma per line. To make it easier, we can transform the data set so there’s only one word in each:

1change
2deep-thoughts
3thinking
4world
5abilities
6choices
Copy

In order to make this transformation of the data we can use search and replace functionalities of code text editors such as SublimeText, Notepad++ or Visual Studio Code.

This is how you can do it using Sublime Text:

  1. Open the data file or paste it into a new tab in the program
  2. Open Search and Replace dialog by using the shortcut CTRL + H
    1. Enable Regex through the button “.*” in the bottom left corner
    2. Search for line breaks using the control character \n
    3. Replace with a comma (“,”)
  3. Click on “Replace All”

Once the process finishes, all the words will be in a single row, separated by commas.

Finally, we replace commas with \n (newline).

Once the replacing is done, we have a normalized dataset with only one word per line.

2. Approaches with spreadsheets

Let's work again with data from http://quotes.toscrape.com/. Our goal for this example is to make sure that the Top Ten tags stated on the website are indeed the top ten tags present in the scraped data.

After scraping the data from the page and loading it into a spreadsheet, we will be using Google Sheets for this example.

The first step will be to split the tags column into several columns so that we can count each word individually and analyze the results better:

  • Select the tags column
  • Click on the Data menu option
  • Then Split text to columns
  • For the Separator window: choose “Comma”

The next step will be to convert the multiple tag columns into a single one. This can be done by defining a new range:

  • Select the range with tags (e.g., D2:K101)
  • Click on the Data menu option → Named ranges
  • Enter a name, e.g., “Tags”

Then apply the following formula and drag it down to the expected total length (e.g., 800 rows):

1\=INDEX(Tags,1+INT((ROW(A1)-1)/COLUMNS(Tags)),MOD(ROW(A1)-1+COLUMNS(Tags),COLUMNS(Tags))+1)
Copy

Next, create a new column next to the tags populated with 1 (this will be used for counting).

The final step is to create a Pivot table:

  • Select the entire range (tags + count column)
  • Data → Pivot Table → Create
  • Rows: add the Tags column
  • Values: add the Count column (summarized by SUM)

Sort the pivot table by count descending to verify the top tags match the website.

3. Manually checking multiple pages

The top tags were verified in the previous example. What if we need to open the URLs of the top 10 tags and visually check them? In this case, we can use a combination of tools such as:

  • Open multiple URLs (Chrome extension)
  • Session Buddy
  • Copy all URLs
  • Sublime Text or Notepad++ to generate tag links

Tag links follow this structure: http://quotes.toscrape.com/tag/<tag>/

To visually check selected tags (e.g., change, deep-thoughts, thinking, world):

  1. Paste them into Sublime Text or Notepad++
  2. Copy the base URL: http://quotes.toscrape.com/tag/
  3. Use block selection to prepend the base URL to each line

Result:

  • http://quotes.toscrape.com/tag/change/
  • http://quotes.toscrape.com/tag/deep-thoughts/
  • http://quotes.toscrape.com/tag/thinking/
  • http://quotes.toscrape.com/tag/world/

Open all simultaneously using the “Open Multiple URLs” extension. After manual review, keep problematic tabs open and use “Copy All URLs” and “Session Buddy” to save lists of good/bad pages.

4. Difference check

Whenever side-by-side comparison is helpful, diff tools like WinMerge can be used. For example, to verify all category links from books.toscrape.com were scraped:

Normalize both lists (website categories and scraped links) by converting to lowercase, sorting alphabetically, and extracting only the category name. Then compare them in WinMerge – identical lists confirm complete extraction.

5. Structured data testing tool

Some scraping methods rely on internal structures like microdata or JSON-LD. To validate these, use Google’s Structured Data Testing Tool and compare its output with your scraped data.

6. SQL

SQL is powerful for spotting anomalies in larger datasets. For example, with scraped book data, you can:

  • Check min/max prices for invalid values
  • Aggregate fields to detect unexpected patterns (e.g., a constant “_type” value indicating a scraping mistake)

Conclusions

In this post, we showcased multiple semi-automated techniques which, combined with the approaches shown in the previous posts of the series, will hopefully help bring creative ideas into your data quality assurance process to test your data better.

Want to learn more about web data QA? Read the Practical guide to web data QA part V: Broad crawls

Learn more about Enterprise Web Scraping

Web scraping can look deceptively easy when you're starting out. There are numerous open-source libraries/frameworks and data extraction tools that make it very easy to scrape data from a website. But in reality, it can be very hard to extract web data at scale. Read our whitepaper and learn how to build a scalable web scraping infrastructure for your business or project.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
How To
I

Ivan Ivanov, Warley Lopes

More from this author

In this article

  • A practical guide to web data QA part IV: Complementing semi-automated techniques
  • 1. Text editor search-and-replace techniques
  • 2. Approaches with spreadsheets
  • 3. Manually checking multiple pages
  • 4. Difference check
  • 5. Structured data testing tool
  • 6. SQL
  • Conclusions
  • Learn more about Enterprise Web Scraping

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Teaching AI to scrape like a pro: how we measure LLMs’ data quality
How To

Teaching AI to scrape like a pro: how we measure LLMs’ data quality

AI-enabled code editors can now conjure scraping code on command. But is it any good? Here’s how Zyte re-engineered LLMs with Web Scraping Copilot to drive best-in-class output.

Theresia Tanzil·10 min·February 23, 2026
Analyze web data quickly with Jupyter Notebooks and Zyte API
How To

Analyze web data quickly with Jupyter Notebooks and Zyte API

With AI Scraping in Zyte API, you can pull data from any e-commerce website straight into your Jupyter notebooks.

Neha Setia Nagpal·2 mins·December 13, 2024
Overcoming web scraping challenges of Puppeteer and Playwright
How To

Overcoming web scraping challenges of Puppeteer and Playwright

Discover the challenges of scaling web scraping with Playwright & Puppeteer, from browser farm management to IP rotation and anti-scraping tactics.

Neha Setia Nagpal·1 mins·December 5, 2024

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026