A practical guide to web data QA part IV: Complementing semi-automated techniques

If you haven’t read the previous ones, here’s the first part, the second and third part of the series.

In this article, we build upon some of the semi-automated techniques and tools introduced in the previous part of the series.

1. Text editor search-and-replace techniques

Let’s say that the data we work with is separated by comma and line breaks:

change,deep-thoughts,thinking,world
abilities,choices

However, there isn’t a consistency of how many words are separated by comma per line. To make it easier, we can transform the data set so there’s only one word in each:

change
deep-thoughts
thinking
world
abilities
choices

In order to make this transformation of the data we can use search and replace functionalities of code text editors such as SublimeText, Notepad++ or Visual Studio Code.

This is how you can do it using Sublime Text:

Open the data file or paste it into a new tab in the program
Open Search and Replace dialog by using the shortcut CTRL + H
1. Enable Regex through the button “.*” in the bottom left corner
2. Search for line breaks using the control character \n
3. Replace with a comma (“,”)
Click on “Replace All”

Once the process finishes, all the words will be in a single row, separated by commas.

Finally, we replace commas with \n (newline).

Once the replacing is done, we have a normalized dataset with only one word per line.

2. Approaches with spreadsheets

Let's work again with data from http://quotes.toscrape.com/. Our goal for this example is to make sure that the Top Ten tags stated on the website are indeed the top ten tags present in the scraped data.

After scraping the data from the page and loading it into a spreadsheet, we will be using Google Sheets for this example.

The first step will be to split the tags column into several columns so that we can count each word individually and analyze the results better:

Select the tags column
Click on the Data menu option
Then Split text to columns
For the Separator window: choose “Comma”

The next step will be to convert the multiple tag columns into a single one. This can be done by defining a new range:

Select the range with tags (e.g., D2:K101)
Click on the Data menu option → Named ranges
Enter a name, e.g., “Tags”

Then apply the following formula and drag it down to the expected total length (e.g., 800 rows):

=INDEX(Tags,1+INT((ROW(A1)-1)/COLUMNS(Tags)),MOD(ROW(A1)-1+COLUMNS(Tags),COLUMNS(Tags))+1)

Next, create a new column next to the tags populated with 1 (this will be used for counting).

The final step is to create a Pivot table:

Select the entire range (tags + count column)
Data → Pivot Table → Create
Rows: add the Tags column
Values: add the Count column (summarized by SUM)

Sort the pivot table by count descending to verify the top tags match the website.

3. Manually checking multiple pages

The top tags were verified in the previous example. What if we need to open the URLs of the top 10 tags and visually check them? In this case, we can use a combination of tools such as:

Open multiple URLs (Chrome extension)
Session Buddy
Copy all URLs
Sublime Text or Notepad++ to generate tag links

Tag links follow this structure: http://quotes.toscrape.com/tag/<tag>/

To visually check selected tags (e.g., change, deep-thoughts, thinking, world):

Paste them into Sublime Text or Notepad++
Copy the base URL: http://quotes.toscrape.com/tag/
Use block selection to prepend the base URL to each line

Result:

http://quotes.toscrape.com/tag/change/
http://quotes.toscrape.com/tag/deep-thoughts/
http://quotes.toscrape.com/tag/thinking/
http://quotes.toscrape.com/tag/world/

Open all simultaneously using the “Open Multiple URLs” extension. After manual review, keep problematic tabs open and use “Copy All URLs” and “Session Buddy” to save lists of good/bad pages.

4. Difference check

Whenever side-by-side comparison is helpful, diff tools like WinMerge can be used. For example, to verify all category links from books.toscrape.com were scraped:

Normalize both lists (website categories and scraped links) by converting to lowercase, sorting alphabetically, and extracting only the category name. Then compare them in WinMerge – identical lists confirm complete extraction.

5. Structured data testing tool

Some scraping methods rely on internal structures like microdata or JSON-LD. To validate these, use Google’s Structured Data Testing Tool and compare its output with your scraped data.

6. SQL

SQL is powerful for spotting anomalies in larger datasets. For example, with scraped book data, you can:

Check min/max prices for invalid values
Aggregate fields to detect unexpected patterns (e.g., a constant “_type” value indicating a scraping mistake)

Conclusions

In this post, we showcased multiple semi-automated techniques which, combined with the approaches shown in the previous posts of the series, will hopefully help bring creative ideas into your data quality assurance process to test your data better.

Want to learn more about web data QA? Read the Practical guide to web data QA part V: Broad crawls

Learn more about Enterprise Web Scraping

Web scraping can look deceptively easy when you're starting out. There are numerous open-source libraries/frameworks and data extraction tools that make it very easy to scrape data from a website. But in reality, it can be very hard to extract web data at scale. Read our whitepaper and learn how to build a scalable web scraping infrastructure for your business or project.