Mastering data extraction from complex websites with web scraping APIs
Read Time
10 mins
Posted on
September 3, 2024
By
Cleber Alexandre
Mastering data extraction from complex websites with web scraping APIs
Ban management is only the beginning
We all know that crawling a complex website without triggering any bans is the first big hurdle for any web data extraction project. But once you've made it in, the website's content could be static and simple or dynamic and complex.
Dynamic content has become increasingly common for various reasons, and it can be a major roadblock for a web scraping project if it's not crawled and managed correctly.Â
This article covers how to scrape a website that has dynamic content.
Strategies will depend on the type of website complexity
For complex websites that rely heavily on JavaScript (AJAX), or similar languages to handle dynamic content, the typical approach for writing spiders will not suffice.
This is often true for major e-commerce websites that are dynamically updating their prices and product variations, and also for websites that are getting data from multiple sources without changing their HTML structure.
For those cases, you need specialized libraries, advanced spiders, and the best of web scraping automation tools. Let’s analyze the different types of complex and dynamic content and how web scraping APIs are fit for the job.
Get expert advice on complex use cases
Join thousands of web scraping developers in the exclusive Extract Data Community on Discord for quick, reliable answers from industry experts anytime.​
Search-gated content and large catalogs
When dealing with websites that require users to search their databases to access content, adopting a search-based crawling strategy is highly effective. This approach is particularly useful for search-gated content, where you only gain access to the needed information after performing a search through a search box or adding queries to the website URL.
For websites with massive catalogs, such as e-commerce sites, performing specific queries rather than scraping everything can save significant time and resources. This method allows you to target only the desired content, making the process more efficient.
The search process can be executed with or without browser rendering.
A practical solution is to use a web scraping API equipped with a built-in headless browser that supports programmable actions. You can program the browser to interact with the search box—typing in the query and navigating to the search results page—and then feed your spider with this resulting search results web page.
Zyte API offers a headless browser with programmable actions that can be utilized directly from any spider, making it a valuable tool for handling search-based crawling.
Continually updated websites
Major e-commerce websites are fueled by sophisticated, real-time pricing strategies, leading to constant updates not only on price, but also on product variants and other content that can give them a competitive edge.
With these dynamic content websites, you have to work against the clock because you’ll get a different pricing each time you perform an extraction, so you need to develop a system that will only collect data when a change is made.
Also, you’ll need to update your current scraped products with the new variants or the new content you may find.
Zyte API comes with the automatic extraction feature powered by AI, allowing developers to start getting product data from any e-commerce website in seconds. Any changes in the website’s schema won't affect the schema of the extracted data you get.
Continual website layout changes
Brands need to remain digitally competitive, keeping up with new consumer trends and behaviors, and they do so by updating their websites’ layouts. New branding experiences, conversion rate optimization, A/B testing, and seasonal campaigns are common reasons to trigger layout changes.
The typical approach to web scraping will fall short in these cases because the spiders break if strictly built against a given website structure.
We now have new technologies taking advantage of AI, able to interpret a dynamic website page to locate and extract the necessary data, even if it has changed its original position. Zyte API, for instance, comes with an automatic extraction feature optimized for the most common data types: products, articles, SERPs and job postings.
Zyte API can also use AI to extract navigation data. You can build spiders that entirely rely on AI for both crawling and parsing, so a single input URL is enough for a spider to automatically give you, for example, all the data of products from a given category of an e-commerce website. No website-specific code necessary.
Also, you'll access ready-to-use AI-powered spider templates that use these features of Zyte API, while making it easy for you to override AI results through web-poet page objects, making sure you are always in control.
Pagination
Getting all items from a product category or a trending topic on a news website will often require you to sort through several pages of content. You may need only X first pages, or the last ones, or content from a specific page. You can automate the navigation through pagination menus in any website using custom crawling rules in your spiders.
Zyte API uses AI and ML to automatically extract web data from common data types, such as articles, products, job postings and SERPs, without needing to write or maintain parsing code for each site. This automatic extraction already considers navigation data such as category following and pagination.
Infinite scrolling
Loading items as the user reaches the end of a webpage is an alternative solution to pagination, and commonly used on JavaScript-heavy websites.
The best approach is to reverse-engineer the JavaScript code to get the infinite scrolling content, which is usually implemented through paginated API requests.
Zyte API’s headless browser comes with standard actions that allow scrolling, like scrollBottom or scrollTo, but bear in mind that it has a limited run time of less than 1 minute, which might not be enough for some cases.
CAPTCHAs
The best way to deal with modern CAPTCHAs is to understand and adapt to the triggers that activate them, using strategies like rotating proxies, varying request frequency, and organic crawling patterns to avoid detection.
These tactics can be easily set using a web scraping API. Zyte API will already configure the necessary settings to unblock any website for you without triggering CAPTCHAs.
Geo-locked content
International websites often detect a user’s location to display specific content, such as translations, local currencies, or time zone-related information. E-commerce sites may use this location data to set varying shipping prices and rules. And some websites can use this location to ban visits from some countries.
When scraping websites that respond dynamically to geolocation, it’s essential to access them through localized proxies. Zyte API includes an automatic geolocation feature that adjusts its location based on the website’s requirements. Additionally, you can extend the geolocation options to more than 200 countries using Extended Geolocation.
Some websites use ZIP codes or other physical address fields to display specific content. While websites cannot reliably determine your ZIP code from your IP address, using a geolocation from the same country usually grants access to the data. However, you may need to configure a ZIP code on the website manually.
In such cases, you can analyze how the website sets physical address information by examining cookies or other methods, then inject this data into your HTTP requests. Zyte API provides the setLocation action, allowing you to configure the target physical address, store the corresponding cookies, and reuse them in follow-up requests for price data.
In other scenarios, websites might store this configuration not directly in your cookies but in a server-side session record, the ID of which is stored in your cookies. If the same session ID is used across different IP addresses or browsers, some websites may block your requests. To avoid this, you can use Zyte API server-managed sessions, ensuring your requests rotate through sessions with pre-configured ZIP codes.
Relying on web scraping APIs to address the most dynamic content demands
Web scraping APIs are built differently from Proxy APIs or Website “Unblocker” APIs, which are tools designed for one job: overcoming bans to access a website’s content.
While some proxy APIs can connect to headless browsers and other tools, only web scraping APIs have the built-in structure to connect your spiders to a plethora of different tools that you’ll eventually need when scraping difficult websites at scale.
The web will keep evolving, and it will require such of our web scraping tools. Zyte API is evolving into the most adaptable and resilient web scraping API, allowing you to work with both simple and complex websites, regardless of the size of your project.
Claim your free credits and give it a try.
FAQ
Why do typical spiders fail to scrape dynamic content?
Typical spiders are designed for static websites. They often fail with dynamic content that relies heavily on JavaScript (AJAX), or real-time updates, requiring specialized libraries and tools.
How can I scrape websites with search-gated content?
For search-gated content, use a web scraping API with a headless browser that can perform search-based crawling. This allows you to target specific content efficiently.
How do I handle scraping for continually updated websites?
Use tools like Zyte API with automatic extraction features powered by AI to handle frequent updates and changes in website schemas without needing constant code maintenance.
What if a website frequently changes its layout?
Zyte API utilizes AI to automatically adapt to layout changes, ensuring data extraction remains accurate without needing specific code adjustments for each website.
How can I deal with pagination during web scraping?
Automate pagination navigation using custom crawling rules or rely on Zyte API's AI to handle common pagination types automatically.
What’s the best approach for websites using infinite scrolling?
Use a headless browser with programmable actions to mimic scrolling behavior or reverse-engineer the JavaScript that handles infinite scrolling.
How can I bypass CAPTCHAs while scraping?
Implement strategies like rotating proxies, adjusting request frequency, and using organic crawling patterns. Zyte API handles these settings to help avoid triggering CAPTCHAs.
How do I access geo-locked content during scraping?
Utilize localized proxies to access content restricted by geolocation. Zyte API offers automatic geolocation adjustments and extended options for over 200 countries.
Can I scrape content based on specific ZIP codes or addresses?
Yes, by analyzing and injecting necessary data into HTTP requests or configuring server-managed sessions, Zyte API enables scraping of content specific to ZIP codes or physical addresses.