PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Blog

    Learn

    Case Studies

    Webinars

    Videos

    White Papers

    Join our Community
    Web scraping APIs vs proxies: A head-to-head comparison
    Blog Post
    The seven habits of highly effective data teams
    Blog Post
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
Home
Blog
Beyond text: Unlocking value on the multimedia web
Light
Dark

Beyond text: Unlocking value on the multimedia web

Read Time
10 min
Posted on
March 4, 2026
Use case
The web is about more than the written word. Why companies are racing to harness the power of video, audio and pictures.
By
Theresia Tanzil
IntroductionThe post-text opportunityHarnessing multimedia dataMultimedia, the untapped frontier
×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more
Subscribe to our Blog
Table of Contents

The web may have started with text - hypertext, in fact - but, today, the global network is richer and more dynamic than pages of words. Images, video, and audio have brought the internet to life.


Video now accounts for over 82% of all internet traffic. As consumers’ bandwidth has grown, so has their appetite for moving images and music. Multimedia content is humanity’s new mirror unto itself.


In business, however, most organizations trying to understand the world through web data are mostly still reading text. By ignoring some of the richest, most valuable sources of information available, they may be leaving new insights on the table.

The post-text opportunity

Text data powered the first wave of the machine learning revolution. It gave us search engines, recommendation systems, and Large Language Models.


However, as text data becomes more widely mined, organizations looking for the next source of advantage are expanding their scope. Differentiation increasingly comes from signals beyond text.


Those signals live in data that shows what things look like, how they change over time, and the context surrounding them; things that text can only describe, but never truly capture.


At Zyte, we’ve seen this shift firsthand. Zyte API’s bandwidth consumed by images, video, and audio exploded by a factor of 200 through 2025.


Many of our customers are building the future on this data.


  • They’re training generative models that need diverse, clean, and legally compliant multimedia datasets.

  • They’re grounding their language models in the physical world, using the eyes that cameras bring to teach them what a thing is, not just how it’s described.

  • Analytics teams are spotting trends and risks such as counterfeit products in marketplace images.

  • By analyzing product images and user-generated visuals from the public web, e-commerce teams support use cases like visual search, duplicate detection, and more accurate product matching.

  • Financial intelligence providers analyze non-text web assets such as earnings call recordings, U.S. Securities and Exchange Commission (SEC) filings documents, and Environmental, Social, and Governance (ESG) disclosure reports.

Harnessing multimedia data

Multimedia data is the new frontier of web data. But getting there isn’t as simple as just pointing your scraper at a JPEG instead of a <p> tag.


For anyone who knows the 20-year history of web scraping, it goes without saying that scraping video, audio, and images is a whole new ball game.


Size matters


First, there’s the sheer size of it all. Multimedia files are, of course, larger than HTML pages.


At production scale, downloads become orders of magnitude heavier, putting immense strain on the infrastructure needed for processing, storage, and transfer.


“Storage, bandwidth, and compute costs compound at scale,” says Martin Olveyra, a senior web scraping engineer at Zyte.


But in his experience, simply throwing more in-memory storage at the problem, while often faster than disk storage, is not enough. After testing a variety of both storage and network configurations, he found both avenues still require careful tuning to avoid bottlenecks.


It’s not enough to have the raw power; we have to be smart about how we use it,” Olveyra adds.


“Due to the sheer size of files generated in these projects, every bit of optimization we can do to avoid spending bandwidth and compute on data we won’t deliver is worth the effort.”


Getting the right multimedia


Optimization is not just about asset size; it’s about getting the right data.


“We see a clear trend across customers that quantity is not the most important thing anymore,” notes Ana Lucia Martins, a project manager at Zyte Data.


Given the increased file size and infrastructure demands, when it comes to multimedia, delivering superfluous data can really hurt a project.


Conventional text scraping has developed several tools and techniques to assess the validity of incoming content. But multimedia content is uniquely structured and delivered.


Zyte’s team engineered a multi-pass deduplication process to ensure it only spends resources on unique and new multimedia assets.


“The different kinds of post-processing were very challenging but fun,” Olveyra added. “We implemented computer vision techniques for tasks like similarity detection and watermark identification, while keeping computational costs tightly controlled.


“Because you do need to download and process media files before knowing whether it will be delivered, optimization is about making unavoidable work affordable. We avoided full video processing by downloading short segments and sampling frames – but even that involves heavy computation when repeated millions of times in constrained environments.”


So, gathering the right multimedia involves building systems that can retrieve and evaluate assets on the fly, filtering out irrelevant or low-quality content before it ever gets delivered, making the work of processing these massive files as affordable as possible, ensuring that every CPU cycle and every byte of bandwidth is spent on data that matters.


Caring for media servers


“Polite” scraping is a baseline requirement for any web data collection, whatever the modality.


But multimedia assets change what “politeness” means in practice. After all, for website owners serving large video files, the cost of excess traffic could be punitive.


  • In scraping web pages, a polite access policy involves controlling request rates to mitigate against overloading websites.

  • With multimedia, however, it means managing bandwidth consumption, connection duration, and request scheduling. These are large, long-lived transfers that have to be sustained over time without overwhelming the source platform.


Websites themselves also employ methods to ensure excess traffic does not unduly hurt their sites.


“It’s almost like each site has its own personality and mood,” says Diogo Suguimoto, a web scraping engineer at Zyte.


“It’s a team effort to familiarize ourselves with each site’s quirks and keep up with all the changes to any mechanisms it deploys.”


Zyte’s Ana Lucia Martins adds: “It’s counterintuitive, really, but scraping politely helps us go far rather than fast, as it reduces retries and failures over time. We are in this for the long game.”


Transforming risk management into enabling constraints


Legal and policy considerations can be key when it comes to multimedia data, especially in the age of AI.


Photographs and videos can contain personally identifiable information (PII) in the form of faces or embedded text. Videos can be subject to copyright. 


A robust multimedia data pipeline must account for these factors from the very beginning.


Zyte Data, Zyte’s done-for-you data-gathering service, has a strict compliance process to vet data sources, formulate scraping policies for each, and embed those policies into a data-gathering pipeline. The aim is to ensure customers receive datasets they can use for AI development, without inheriting hidden legal or model risk.

Multimedia, the untapped frontier

The value of public web data isn’t gone; it’s just moving.


It’s shifting from the flat, black-and-white world of text to the rich, colorful, and dynamic world of multimedia.


That new paradigm poses challenges as well as opportunities. But the tools and the teams have evolved to meet the multimedia moment, ensuring web scraping remains the path to unlock this value.

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more

Get the latest posts straight to your inbox

No matter what data type you're looking for, we've got you

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026