Beyond text: Unlocking value on the multimedia web

The web may have started with text - hypertext, in fact - but, today, the global network is richer and more dynamic than pages of words. Images, video, and audio have brought the internet to life.

Video now accounts for over 82% of all internet traffic. As consumers’ bandwidth has grown, so has their appetite for moving images and music. Multimedia content is humanity’s new mirror unto itself.

In business, however, most organizations trying to understand the world through web data are mostly still reading text. By ignoring some of the richest, most valuable sources of information available, they may be leaving new insights on the table.

The post-text opportunity

Text data powered the first wave of the machine learning revolution. It gave us search engines, recommendation systems, and Large Language Models.

However, as text data becomes more widely mined, organizations looking for the next source of advantage are expanding their scope. Differentiation increasingly comes from signals beyond text.

Those signals live in data that shows what things look like, how they change over time, and the context surrounding them; things that text can only describe, but never truly capture.

At Zyte, we’ve seen this shift firsthand. Zyte API’s bandwidth consumed by images, video, and audio exploded by a factor of 200 through 2025.

Many of our customers are building the future on this data.

They’re training generative models that need diverse, clean, and legally compliant multimedia datasets.
They’re grounding their language models in the physical world, using the eyes that cameras bring to teach them what a thing is, not just how it’s described.
Analytics teams are spotting trends and risks such as counterfeit products in marketplace images.
By analyzing product images and user-generated visuals from the public web, e-commerce teams support use cases like visual search, duplicate detection, and more accurate product matching.
Financial intelligence providers analyze non-text web assets such as earnings call recordings, U.S. Securities and Exchange Commission (SEC) filings documents, and Environmental, Social, and Governance (ESG) disclosure reports.

Harnessing multimedia data

Multimedia data is the new frontier of web data. But getting there isn’t as simple as just pointing your scraper at a JPEG instead of a <p> tag.

For anyone who knows the 20-year history of web scraping, it goes without saying that scraping video, audio, and images is a whole new ball game.

Size matters

First, there’s the sheer size of it all. Multimedia files are, of course, larger than HTML pages.

At production scale, downloads become orders of magnitude heavier, putting immense strain on the infrastructure needed for processing, storage, and transfer.

“Storage, bandwidth, and compute costs compound at scale,” says Martin Olveyra, a senior web scraping engineer at Zyte.

But in his experience, simply throwing more in-memory storage at the problem, while often faster than disk storage, is not enough. After testing a variety of both storage and network configurations, he found both avenues still require careful tuning to avoid bottlenecks.

It’s not enough to have the raw power; we have to be smart about how we use it,” Olveyra adds.

“Due to the sheer size of files generated in these projects, every bit of optimization we can do to avoid spending bandwidth and compute on data we won’t deliver is worth the effort.”

Getting the right multimedia

Optimization is not just about asset size; it’s about getting the right data.

“We see a clear trend across customers that quantity is not the most important thing anymore,” notes Ana Lucia Martins, a project manager at Zyte Data.

Given the increased file size and infrastructure demands, when it comes to multimedia, delivering superfluous data can really hurt a project.

Conventional text scraping has developed several tools and techniques to assess the validity of incoming content. But multimedia content is uniquely structured and delivered.

Zyte’s team engineered a multi-pass deduplication process to ensure it only spends resources on unique and new multimedia assets.

“The different kinds of post-processing were very challenging but fun,” Olveyra added. “We implemented computer vision techniques for tasks like similarity detection and watermark identification, while keeping computational costs tightly controlled.

“Because you do need to download and process media files before knowing whether it will be delivered, optimization is about making unavoidable work affordable. We avoided full video processing by downloading short segments and sampling frames – but even that involves heavy computation when repeated millions of times in constrained environments.”

So, gathering the right multimedia involves building systems that can retrieve and evaluate assets on the fly, filtering out irrelevant or low-quality content before it ever gets delivered, making the work of processing these massive files as affordable as possible, ensuring that every CPU cycle and every byte of bandwidth is spent on data that matters.

Caring for media servers

“Polite” scraping is a baseline requirement for any web data collection, whatever the modality.

But multimedia assets change what “politeness” means in practice. After all, for website owners serving large video files, the cost of excess traffic could be punitive.

In scraping web pages, a polite access policy involves controlling request rates to mitigate against overloading websites.
With multimedia, however, it means managing bandwidth consumption, connection duration, and request scheduling. These are large, long-lived transfers that have to be sustained over time without overwhelming the source platform.

Websites themselves also employ methods to ensure excess traffic does not unduly hurt their sites.

“It’s almost like each site has its own personality and mood,” says Diogo Suguimoto, a web scraping engineer at Zyte.

“It’s a team effort to familiarize ourselves with each site’s quirks and keep up with all the changes to any mechanisms it deploys.”

Zyte’s Ana Lucia Martins adds: “It’s counterintuitive, really, but scraping politely helps us go far rather than fast, as it reduces retries and failures over time. We are in this for the long game.”

Transforming risk management into enabling constraints

Legal and policy considerations can be key when it comes to multimedia data, especially in the age of AI.

Photographs and videos can contain personally identifiable information (PII) in the form of faces or embedded text. Videos can be subject to copyright.

A robust multimedia data pipeline must account for these factors from the very beginning.

Zyte Data, Zyte’s done-for-you data-gathering service, has a strict compliance process to vet data sources, formulate scraping policies for each, and embed those policies into a data-gathering pipeline. The aim is to ensure customers receive datasets they can use for AI development, without inheriting hidden legal or model risk.

Multimedia, the untapped frontier

The value of public web data isn’t gone; it’s just moving.

It’s shifting from the flat, black-and-white world of text to the rich, colorful, and dynamic world of multimedia.

That new paradigm poses challenges as well as opportunities. But the tools and the teams have evolved to meet the multimedia moment, ensuring web scraping remains the path to unlock this value.