PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Blog

    Learn

    Case Studies

    Webinars

    Videos

    White Papers

    Join our Community
    Web scraping APIs vs proxies: A head-to-head comparison
    Blog Post
    The seven habits of highly effective data teams
    Blog Post
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
Home
Blog
A Practical Guide To Web Data QA Part IV
Light
Dark

A practical guide to web data QA part IV: Complementing semi-automated techniques

Read Time
7 Mins
Posted on
September 3, 2020
How To
If you haven’t read the previous ones, here’s the first part, the second and third part of the series.
By
Ivan Ivanov, Warley Lopes
×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more
Subscribe to our Blog

A practical guide to web data QA part IV: Complementing semi-automated techniques

If you haven’t read the previous ones, here’s the first part, the second and third part of the series.

In this article, we build upon some of the semi-automated techniques and tools introduced in the previous part of the series.

1. Text editor search-and-replace techniques

Let’s say that the data we work with is separated by comma and line breaks:

change,deep-thoughts,thinking,world
abilities,choices

However, there isn’t a consistency of how many words are separated by comma per line. To make it easier, we can transform the data set so there’s only one word in each:

change
deep-thoughts
thinking
world
abilities
choices

In order to make this transformation of the data we can use search and replace functionalities of code text editors such as SublimeText, Notepad++ or Visual Studio Code.

This is how you can do it using Sublime Text:

  1. Open the data file or paste it into a new tab in the program
  2. Open Search and Replace dialog by using the shortcut CTRL + H
    1. Enable Regex through the button “.*” in the bottom left corner
    2. Search for line breaks using the control character \n
    3. Replace with a comma (“,”)
  3. Click on “Replace All”

Once the process finishes, all the words will be in a single row, separated by commas.

Finally, we replace commas with \n (newline).

Once the replacing is done, we have a normalized dataset with only one word per line.

2. Approaches with spreadsheets

Let's work again with data from http://quotes.toscrape.com/. Our goal for this example is to make sure that the Top Ten tags stated on the website are indeed the top ten tags present in the scraped data.

After scraping the data from the page and loading it into a spreadsheet, we will be using Google Sheets for this example.

The first step will be to split the tags column into several columns so that we can count each word individually and analyze the results better:

  • Select the tags column
  • Click on the Data menu option
  • Then Split text to columns
  • For the Separator window: choose “Comma”

The next step will be to convert the multiple tag columns into a single one. This can be done by defining a new range:

  • Select the range with tags (e.g., D2:K101)
  • Click on the Data menu option → Named ranges
  • Enter a name, e.g., “Tags”

Then apply the following formula and drag it down to the expected total length (e.g., 800 rows):

=INDEX(Tags,1+INT((ROW(A1)-1)/COLUMNS(Tags)),MOD(ROW(A1)-1+COLUMNS(Tags),COLUMNS(Tags))+1)

Next, create a new column next to the tags populated with 1 (this will be used for counting).

The final step is to create a Pivot table:

  • Select the entire range (tags + count column)
  • Data → Pivot Table → Create
  • Rows: add the Tags column
  • Values: add the Count column (summarized by SUM)

Sort the pivot table by count descending to verify the top tags match the website.

3. Manually checking multiple pages

The top tags were verified in the previous example. What if we need to open the URLs of the top 10 tags and visually check them? In this case, we can use a combination of tools such as:

  • Open multiple URLs (Chrome extension)
  • Session Buddy
  • Copy all URLs
  • Sublime Text or Notepad++ to generate tag links

Tag links follow this structure: http://quotes.toscrape.com/tag/<tag>/

To visually check selected tags (e.g., change, deep-thoughts, thinking, world):

  1. Paste them into Sublime Text or Notepad++
  2. Copy the base URL: http://quotes.toscrape.com/tag/
  3. Use block selection to prepend the base URL to each line

Result:

  • http://quotes.toscrape.com/tag/change/
  • http://quotes.toscrape.com/tag/deep-thoughts/
  • http://quotes.toscrape.com/tag/thinking/
  • http://quotes.toscrape.com/tag/world/

Open all simultaneously using the “Open Multiple URLs” extension. After manual review, keep problematic tabs open and use “Copy All URLs” and “Session Buddy” to save lists of good/bad pages.

4. Difference check

Whenever side-by-side comparison is helpful, diff tools like WinMerge can be used. For example, to verify all category links from books.toscrape.com were scraped:

Normalize both lists (website categories and scraped links) by converting to lowercase, sorting alphabetically, and extracting only the category name. Then compare them in WinMerge – identical lists confirm complete extraction.

5. Structured data testing tool

Some scraping methods rely on internal structures like microdata or JSON-LD. To validate these, use Google’s Structured Data Testing Tool and compare its output with your scraped data.

6. SQL

SQL is powerful for spotting anomalies in larger datasets. For example, with scraped book data, you can:

  • Check min/max prices for invalid values
  • Aggregate fields to detect unexpected patterns (e.g., a constant “_type” value indicating a scraping mistake)

Conclusions

In this post, we showcased multiple semi-automated techniques which, combined with the approaches shown in the previous posts of the series, will hopefully help bring creative ideas into your data quality assurance process to test your data better.

Want to learn more about web data QA? Read the Practical guide to web data QA part V: Broad crawls

Learn more about Enterprise Web Scraping

Web scraping can look deceptively easy when you're starting out. There are numerous open-source libraries/frameworks and data extraction tools that make it very easy to scrape data from a website. But in reality, it can be very hard to extract web data at scale. Read our whitepaper and learn how to build a scalable web scraping infrastructure for your business or project.

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more

Get the latest posts straight to your inbox

No matter what data type you're looking for, we've got you

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026