Teaching AI to scrape like a pro: how we measure LLMs’ data quality

In the past couple of years, AI coding assistants have gone from magic power to business-as-usual. You open your code editor, type a comment, and a Large Language Model (LLM) fills in the blanks.

But there's a problem. When you ask a general AI assistant to write code, it's pulling from billions of lines of examples. Very few of them show what good code looks like.

When it comes to specialist code cases, like web scraping, that’s problematic, because scraping code - especially high-quality scraping code - is relatively under-represented in the global training data.

So, how can AI help write good scraping code? This was a key design consideration when we built Web Scraping Copilot, an AI-powered Visual Studio Code extension that specializes in generating and managing web scraping code.

Screenshot 2026 02 23 At 12.28.46  P M

We wanted to help web scraping developers not just code faster, but also to ship good scraping code, fast.

So, what does "good" scraping code actually look like, and how do you get it?

What makes scraping code ‘good’?

Before teaching Web Scraping Copilot how to generate quality code, we first had to define what constitutes quality in the specific case of web scraping.

The data industry has never really formalized such a standard, so Zyte needed to build a measurement system where none existed.

We turned to our team of hundreds of Scrapy experts, distilling its experience in creating accurate and maintainable Scrapy code into three quantifiable dimensions that describe data accuracy and code maintainability.

Variable

Area

Measurement test

ROUGE-1 F1 adj

Data accuracy

The code extracts the right data with the right values.

Source lines of code (SLOC)

Code complexity

The code is tight and non superfluous.

Cyclomatic complexity

Code complexity

The logic is simple and understandable.

If we could define what “accuracy” and “maintainability” actually mean, we could score them.

1. The accuracy challenge: Measuring messy data

Measuring the accuracy of extracted web data is trickier than it seems, because desired output is not always clear from AI prompt inputs.

So we adapted a metric from natural language processing called ROUGE-1 F1, which measures token-level overlap between texts, and extended it to handle structured web data.

This metric gives partial credit for values that differ in formatting but are semantically equivalent (like “24.99”, “$24.99”, and “24.99 USD”) – letting us score thousands of extraction attempts without penalizing harmless variations.

ROUGE-1 F1 scores on a sliding scale from 0 to 1, with lower values for poorer accuracy and greater values for higher accuracy. With this benchmark in our toolset, we could be confident that we are not skipping relevant data.

Accuracy Rouge

2. The maintainability challenge: The leaner, the better

Scraping code that picks up accurate data also needs to be easily understood and adapted as websites change.

The first signal we look at for maintainability is the length of the code generated.

We record the source lines of code (SLOC) in each generated spider.

Fewer lines generally mean less surface area for bugs and lower maintenance cost over time. Keeping SLOC low encourages spiders that are focused, declarative, and easier to reason about.

3. The complexity challenge: Going deep without getting lost

Yet, code length alone doesn’t tell the full story. Two spiders with the same number of lines can vary in how easy they are to understand.

That’s where cyclomatic complexity comes in.

Cyclomatic complexity measures how many independent decision paths exist in a piece of code - essentially, how many branches, conditionals, and forks a reader has to keep in their head at once.

Lower values are generally better: they indicate linear, predictable logic that is easier to test and modify. Higher values suggest brittle code where small changes can have unintended side effects

Cyclomatic complexity score range

Interpretation

1 - 10

Simple to moderate complexity. Low risk.

11 - 20

Moderate. Careful review needed to justify.

21 - 40

Complex. Difficult to test and maintain.

Above 40

Unmaintainable.

A well-structured spider would typically land at around five to 15.

Bringing it together

Taken together, these metrics let us evaluate a scraper from multiple angles at once. Here’s what that looks like for a single spider:

Scraper name

rouge1_f1_adj

SLOC

Complexity

Product scraper for website A

0.7955

6.25

In this example:

A ROUGE-1 F1 adj score of ~0.8 indicates good extraction accuracy, with minor acceptable variations in formatting.
35 source lines of code suggests the scraper is compact.
A cyclomatic complexity of 6.25 means the logic is straightforward, with intuitive branching.

Together, they give us a practical, repeatable way to judge whether a scraper has good quality.

Iterating toward production quality

With our scoring system in place, we could move toward building a Visual Studio Code extension that reliably produces good scraping code.

For Web Scraping Copilot, that meant perfecting our own extension code and crafting embedded prompts that it uses to turn mass-market LLMs into expert spider generators.

We followed the following process to establish target thresholds for each score:

Data accuracy: The team produced a source-of-truth dataset - a pre-assembled list of 1,250 on-page data fields, from hundreds of URLs, that are known to be correct. By comparing output from our LLM-produced spider code against the values known to be correct, we could make changes to nudge that rouge1_f1_adj score ever closer to 1.
Code complexity: Zyte specialists reviewed the SLOC and cyclomatic complexity scores for LLM-produced, to assess whether generated spiders met their expectations for clarity and structure.

After a couple of iterations, it became clear - good LLM-generated scraping code on average has a scorecard like this:

rouge1_f1_adj

SLOC

Complexity

0.8 +

30 to 40

< 12

With these targets in place, improvements followed a reliable process: adjust prompts or tooling, re-run code generation, and check whether changes moved quality in the right direction across all metrics.

Sometimes, gains were obvious. Other times, they revealed trade-offs: a change might reduce the number of generation attempts needed to produce working code (good), while slightly hurting extraction accuracy (bad). In those cases, we only accepted changes when the overall outcome clearly delivered more value than it cost.

AI code quality is real, today

Today, Web Scraping Copilot consistently generates scraping code that meets the quality bar we set during development and does so in a measurable, repeatable way.

Just as importantly, these scores are not treated as a one-time gate. They are monitored continuously. Every prompt change, tooling adjustment, or model upgrade is evaluated against the same metrics to ensure quality does not regress as the system evolves. When we see improvements, we raise expectations. When tradeoffs appear, we consider them holistically.

Every iteration brings Web Scraping Copilot closer to thinking less like a generic AI coding assistant and more like a colleague who has spent years writing production scrapers.

And the beauty of scoring our own product’s output in this way is that we can apply the same approach to rating the relative quality of scraping code produced by any of the LLM models usable by the extension.

W S C Bench 20260217

For instance, when Anthropic released Sonnet 4.6 in February 2026, Zyte’s research and development team was able to crunch the numbers to show how it beat all rival models in most of the score areas.

That is, at the time - Sonnet 4.6, when instructed by Web Scraping Copilot’s best-in-class, secret-sauce scraping know-how - produced the very best auto-generated scraping code.

We are excited to see where these scores go next, as frontier models get better and better.

Where general AI stops, Web Scraping Copilot begins

Most of today’s general-purpose AI coding assistants optimize for plausibility and speed, not for long-term accuracy or maintainability.

Zyte has “taught” the AI to code like our best scraping engineers by defining, measuring, and iteratively improving quality along the axes that matter most: accuracy and complexity.

We believe that gaining and maintaining access to web data should be hassle-free, no matter who, or what, is writing the code.

In the past couple of years, AI coding assistants have gone from magic power to business-as-usual. You open your code editor, type a comment, and a Large Language Model (LLM) fills in the blanks.

But there's a problem. When you ask a general AI assistant to write code, it's pulling from billions of lines of examples. Very few of them show what good code looks like.

Screenshot 2026 02 23 At 12.28.46  P M

We wanted to help web scraping developers not just code faster, but also to ship good scraping code, fast.

So, what does "good" scraping code actually look like, and how do you get it?

What makes scraping code ‘good’?

Before teaching Web Scraping Copilot how to generate quality code, we first had to define what constitutes quality in the specific case of web scraping.

The data industry has never really formalized such a standard, so Zyte needed to build a measurement system where none existed.

Variable

Area

Measurement test

ROUGE-1 F1 adj

Data accuracy

The code extracts the right data with the right values.

Source lines of code (SLOC)

Code complexity

The code is tight and non superfluous.

Cyclomatic complexity

Code complexity

The logic is simple and understandable.

If we could define what “accuracy” and “maintainability” actually mean, we could score them.

1. The accuracy challenge: Measuring messy data

Measuring the accuracy of extracted web data is trickier than it seems, because desired output is not always clear from AI prompt inputs.

So we adapted a metric from natural language processing called ROUGE-1 F1, which measures token-level overlap between texts, and extended it to handle structured web data.

Accuracy Rouge

2. The maintainability challenge: The leaner, the better

Scraping code that picks up accurate data also needs to be easily understood and adapted as websites change.

The first signal we look at for maintainability is the length of the code generated.

We record the source lines of code (SLOC) in each generated spider.

Fewer lines generally mean less surface area for bugs and lower maintenance cost over time. Keeping SLOC low encourages spiders that are focused, declarative, and easier to reason about.

3. The complexity challenge: Going deep without getting lost

Yet, code length alone doesn’t tell the full story. Two spiders with the same number of lines can vary in how easy they are to understand.

That’s where cyclomatic complexity comes in.

Cyclomatic complexity measures how many independent decision paths exist in a piece of code - essentially, how many branches, conditionals, and forks a reader has to keep in their head at once.

Lower values are generally better: they indicate linear, predictable logic that is easier to test and modify. Higher values suggest brittle code where small changes can have unintended side effects

Cyclomatic complexity score range

Interpretation

1 - 10

Simple to moderate complexity. Low risk.

11 - 20

Moderate. Careful review needed to justify.

21 - 40

Complex. Difficult to test and maintain.

Above 40

Unmaintainable.

A well-structured spider would typically land at around five to 15.

Bringing it together

Taken together, these metrics let us evaluate a scraper from multiple angles at once. Here’s what that looks like for a single spider:

Scraper name

rouge1_f1_adj

SLOC

Complexity

Product scraper for website A

0.7955

6.25

In this example:

A ROUGE-1 F1 adj score of ~0.8 indicates good extraction accuracy, with minor acceptable variations in formatting.
35 source lines of code suggests the scraper is compact.
A cyclomatic complexity of 6.25 means the logic is straightforward, with intuitive branching.

Together, they give us a practical, repeatable way to judge whether a scraper has good quality.

Iterating toward production quality

With our scoring system in place, we could move toward building a Visual Studio Code extension that reliably produces good scraping code.

For Web Scraping Copilot, that meant perfecting our own extension code and crafting embedded prompts that it uses to turn mass-market LLMs into expert spider generators.

We followed the following process to establish target thresholds for each score:

Data accuracy: The team produced a source-of-truth dataset - a pre-assembled list of 1,250 on-page data fields, from hundreds of URLs, that are known to be correct. By comparing output from our LLM-produced spider code against the values known to be correct, we could make changes to nudge that rouge1_f1_adj score ever closer to 1.
Code complexity: Zyte specialists reviewed the SLOC and cyclomatic complexity scores for LLM-produced, to assess whether generated spiders met their expectations for clarity and structure.

After a couple of iterations, it became clear - good LLM-generated scraping code on average has a scorecard like this:

rouge1_f1_adj

SLOC

Complexity

0.8 +

30 to 40

< 12

AI code quality is real, today

Today, Web Scraping Copilot consistently generates scraping code that meets the quality bar we set during development and does so in a measurable, repeatable way.

Every iteration brings Web Scraping Copilot closer to thinking less like a generic AI coding assistant and more like a colleague who has spent years writing production scrapers.

W S C Bench 20260217

That is, at the time - Sonnet 4.6, when instructed by Web Scraping Copilot’s best-in-class, secret-sauce scraping know-how - produced the very best auto-generated scraping code.

We are excited to see where these scores go next, as frontier models get better and better.

Where general AI stops, Web Scraping Copilot begins

Most of today’s general-purpose AI coding assistants optimize for plausibility and speed, not for long-term accuracy or maintainability.

Zyte has “taught” the AI to code like our best scraping engineers by defining, measuring, and iteratively improving quality along the axes that matter most: accuracy and complexity.

We believe that gaining and maintaining access to web data should be hassle-free, no matter who, or what, is writing the code.