Inside Zyte's System Design Process: How We Build Scalable, Reliable Solutions

During the interviews before joining Zyte, some developers were interested in their developer experience if they joined Zyte. I’m Alexander, and I work on systems architecture at Zyte. In this article, I’m going to explain how we do system design for our products.

Creating an Effective PRD

Making a PRD - a product requirements document, where product owners specify the feature, how it should work, how customers will interact with it, etc. The most important thing to understand at this point is the changes in user experience to be delivered and the cost, which dictates the amount we can spend on developing and maintaining this feature. The audience of the document is the product team.

Here’s an example of the PRD:

The platform team is responsible for developing a sign-up flow and Zyte API dashboard, where the API key is created and needs to be passed to the ZAPI backend to inform the system that there is a new user, to grant access, set up rate limits, and apply specific organization discounts. The development time should not exceed one week per person, and maintenance costs should be negligible, compared to the main workflow. In other words, there is no dedicated budget for the maintenance of this functionality.

Technical requirements for PRD

Making technical requirements - a document containing functional and other requirements, formally explaining the functionality intended for developers. Usually,we use a template to make things easier for writers. When filling out the template fields, one has to decide what the availability, scalability, failover, and other concerns a system will need to meet.

Here’s an example of the technical requirements for the above PRD:

Functional requirements

The provisioning event for the user is generated in the dash web worker, as a result of processing the response from the payments gateway. The structure to be passed will contain a flat JSON with a 20-character API key hex string, three integer rate limits (1, 5, 15 minutes) and a float representing the organization discount applied for this account. The architecture of the web worker is made in a way that it will be difficult to arrange retries.

The system should be able to perform up to 10K API key checks per second, for 1K users.

Non-functional requirements:

The provisioning process should not exceed seconds. The process should be as reliable as possible because the loss of provisioning events results in a very poor user experience and is very hard for support to troubleshoot, and quite likely ending up in developer's sprints. In the case of failures, the functionality is expected to recover itself. Data loss is unacceptable. The functionality could be extended in the future by adding more fields.

The following service level indicators must be introduced and monitored:

Number of provisioning events generated in the dashboard/sign-up system
Number of provisioning events accepted by ZAPI
Time required for a generated provisioning event to be accepted by the ZAPI

The fundamental difference between the two documents is the intended audience, and as a result, the level of detail and concepts used to describe the feature being designed.

Finally, when technical requirements are ready, we ensure that everyone involved in the design process understands them the same way.After we are done with the requirements, we start collecting possible solutions.

Any developer may come up with a half-page proposal explaining the core idea, and we will add them to a document outlining all the options the team has developed.

Here are example half-pager ideas for the above technical requirements:

Use Kafka topic to transfer the provisioning message. The web worker will produce the provisioning message to the topic, and the Zyte API Server will consume it.

Pros: _Very low latency, low computational overhead.
_Cons: A need for a healthy Apache Kafka instance and the cost associated with running it.
_Generate a list of provisioned accounts from an async job in the web worker and upload it to block storage like Amazon S3 or Google GCS, and signal to ZAPI API Server by means of Pub/Sub to download and update.
_

Pros: _Transparency, easy to troubleshoot.
_Cons: The lock-in on Google/amazon’s pub/sub service for notification and block storage.
_Periodically request the full user's list from Zyte API Server and update.
_

Pros: Easy integration, controlled frequency and timing of the updatesCons: Limited latency reduction options, scalability challenges, network/CPU overhead.
_Microservice requests full user lists synchronously from web workers and caches them, and has an API Server to request them on a per-request basis.
Suboptions include implementing using various languages and frameworks.
_

Pros: same as 3, but the solution is optimized for API key lookups, therefore fewer of option 3’s cons.Cons: The need for maintaining and developing a separate component.
_Use local MySQL replica of the users table from web worker, and have API Server directly query it in read-only mode
_

Pros: (Kind of) easy to integrate,Cons: MySQL replicas would have to be tuned to handle the load, replication needs to be monitored and synchronized in the case of failures.
Use Change-Data-Capture to populate the topic with the changes in the web worker users table, using Debezium

Pros: No need to do anything on the web worker side,Cons: Maintenance of Debezium and Kafka, the generation and handling of the event is non-transparent

Once we have a document with possible solutions, we start to compare them. There is no single way of doing it, and sometimes it becomes frustrating, like comparing apples to parrots, but here are a few tips, on how to make sense of it:

Discard minor details, and concentrate on the critical aspects.
Decide which critical aspects are more important than others (development time vs. cost for example).
Collapse similar solutions and make variations.
Summarize the options and their critical aspects in a single table, with as few words as possible, so the table fits a screen or a sheet of paper.

For the product, the critical aspects were the latency, development time, and reliability of the solution.

For example, the summary table for the above solutions could look like this

Kafka-based

Solution	Reaction time	Dev. comfort	Maint. cost	Impl. cost
1. Kafka topic to transfer the provisioning message	Good	Low (because of Kafka)	Low	Average (library + fixing issues on the web worker side)
2. Use Change-Data-Capture to populate the topic with the changes in the web worker users table, using Debezium	Good	Low (the CDC will produce noise, schema migration issues, Deb. is hard to monitor)	Low	High (Debezium setup, testing various scenarios, learning Debezium format)

Synchronous

Solution	Reaction time	Dev. comfort	Maint. cost	Impl. cost
1. Periodically request the full user list from Zyte API Server and update.	Bad	Good (if we discard the scaling issues)	High (network traffic)	High (the system would have to be rebuilt on the web worker side to support the new requirements)
2. Use a local MySQL replica of the user table from the web worker, and have the API Server to directly query it in read-only mode	Bad	Average (replication over public network)	High (local replica maintenance)	High (deploying new HA component and client to access it)

Mixed

Solution	Reaction time	Dev. comfort	Maint. cost	Impl. cost
1. Microservice requests full user list synchronously from web worker and cache them, and has API Server to request them on a per-request basis.	Bad	Depends on impl. details	Average (microservice presence)	Depends on impl. Details, but MS would have to be developed.

Other

Solution	Reaction time	Dev. comfort	Maint. cost	Impl. cost
1. Generate a list of provisioned accounts from an async job in the web worker and upload it to block storage like Amazon S3 or Google GCS, and signal to ZAPI API Server by means of Pub/Sub to download and update.	Good (if using Pub/Sub, Bad otherwise)	High (if we exclude the Pub/Sub)	Around $370	Average (GCS and Pub/Sub are having Java libs)

Finally, we selected option one, because we already had Kafka provisioned at Zyte, and the latency and development time was low.

Conclusion

The above process is an example of the design process developed and used at Zyte for delivering new functionality and maintaining the system. This design process has taken several evolutionary steps, before reaching its current form. For example, initially, there were no technical requirement documents, so a PRD was sent straight to the team for design.

It turned out that technical requirements were understood differently by various team members, and as a result, it took more time to discuss solutions and even led to situations where there was no agreement. To fix this, a new stage of writing technical requirements using a template was introduced. We run weekly open design on-demand sessions, where we’re able to provide quick feedback on the artifacts, and there is an internal knowledge base where one can learn from examples of various artifacts when doing design work.

Creating an Effective PRD

Here’s an example of the PRD:

Technical requirements for PRD

Here’s an example of the technical requirements for the above PRD:

Functional requirements

The system should be able to perform up to 10K API key checks per second, for 1K users.

Non-functional requirements:

The following service level indicators must be introduced and monitored:

Number of provisioning events generated in the dashboard/sign-up system
Number of provisioning events accepted by ZAPI
Time required for a generated provisioning event to be accepted by the ZAPI

The fundamental difference between the two documents is the intended audience, and as a result, the level of detail and concepts used to describe the feature being designed.

Any developer may come up with a half-page proposal explaining the core idea, and we will add them to a document outlining all the options the team has developed.

Here are example half-pager ideas for the above technical requirements:

Use Kafka topic to transfer the provisioning message. The web worker will produce the provisioning message to the topic, and the Zyte API Server will consume it.

Pros: _Very low latency, low computational overhead.
_Cons: A need for a healthy Apache Kafka instance and the cost associated with running it.
_Generate a list of provisioned accounts from an async job in the web worker and upload it to block storage like Amazon S3 or Google GCS, and signal to ZAPI API Server by means of Pub/Sub to download and update.
_

Pros: _Transparency, easy to troubleshoot.
_Cons: The lock-in on Google/amazon’s pub/sub service for notification and block storage.
_Periodically request the full user's list from Zyte API Server and update.
_

Pros: Easy integration, controlled frequency and timing of the updatesCons: Limited latency reduction options, scalability challenges, network/CPU overhead.
_Microservice requests full user lists synchronously from web workers and caches them, and has an API Server to request them on a per-request basis.
Suboptions include implementing using various languages and frameworks.
_

Pros: same as 3, but the solution is optimized for API key lookups, therefore fewer of option 3’s cons.Cons: The need for maintaining and developing a separate component.
_Use local MySQL replica of the users table from web worker, and have API Server directly query it in read-only mode
_

Pros: (Kind of) easy to integrate,Cons: MySQL replicas would have to be tuned to handle the load, replication needs to be monitored and synchronized in the case of failures.
Use Change-Data-Capture to populate the topic with the changes in the web worker users table, using Debezium

Pros: No need to do anything on the web worker side,Cons: Maintenance of Debezium and Kafka, the generation and handling of the event is non-transparent

Discard minor details, and concentrate on the critical aspects.
Decide which critical aspects are more important than others (development time vs. cost for example).
Collapse similar solutions and make variations.
Summarize the options and their critical aspects in a single table, with as few words as possible, so the table fits a screen or a sheet of paper.

For the product, the critical aspects were the latency, development time, and reliability of the solution.

For example, the summary table for the above solutions could look like this

Kafka-based

Solution	Reaction time	Dev. comfort	Maint. cost	Impl. cost
1. Kafka topic to transfer the provisioning message	Good	Low (because of Kafka)	Low	Average (library + fixing issues on the web worker side)
2. Use Change-Data-Capture to populate the topic with the changes in the web worker users table, using Debezium	Good	Low (the CDC will produce noise, schema migration issues, Deb. is hard to monitor)	Low	High (Debezium setup, testing various scenarios, learning Debezium format)

Synchronous

Solution	Reaction time	Dev. comfort	Maint. cost	Impl. cost
1. Periodically request the full user list from Zyte API Server and update.	Bad	Good (if we discard the scaling issues)	High (network traffic)	High (the system would have to be rebuilt on the web worker side to support the new requirements)
2. Use a local MySQL replica of the user table from the web worker, and have the API Server to directly query it in read-only mode	Bad	Average (replication over public network)	High (local replica maintenance)	High (deploying new HA component and client to access it)

Mixed

Solution	Reaction time	Dev. comfort	Maint. cost	Impl. cost
1. Microservice requests full user list synchronously from web worker and cache them, and has API Server to request them on a per-request basis.	Bad	Depends on impl. details	Average (microservice presence)	Depends on impl. Details, but MS would have to be developed.

Other

Solution	Reaction time	Dev. comfort	Maint. cost	Impl. cost
1. Generate a list of provisioned accounts from an async job in the web worker and upload it to block storage like Amazon S3 or Google GCS, and signal to ZAPI API Server by means of Pub/Sub to download and update.	Good (if using Pub/Sub, Bad otherwise)	High (if we exclude the Pub/Sub)	Around $370	Average (GCS and Pub/Sub are having Java libs)

Finally, we selected option one, because we already had Kafka provisioned at Zyte, and the latency and development time was low.

Inside Zyte's System Design Process: How We Build Scalable, Reliable Solutions

Creating an Effective PRD

Technical requirements for PRD

Kafka-based

Synchronous

Mixed

Other

Conclusion

Build your first scraper in minutes

Continue reading

Teaching AI to scrape like a pro: how we measure LLMs’ data quality

Analyze web data quickly with Jupyter Notebooks and Zyte API

Overcoming web scraping challenges of Puppeteer and Playwright

The best of Zyte and the data web, in your inbox.

Inside Zyte's System Design Process: How We Build Scalable, Reliable Solutions

Creating an Effective PRD

Technical requirements for PRD

Kafka-based

Synchronous

Mixed

Other

Conclusion

Build your first scraper in minutes

Continue reading

Teaching AI to scrape like a pro: how we measure LLMs’ data quality

Analyze web data quickly with Jupyter Notebooks and Zyte API

Overcoming web scraping challenges of Puppeteer and Playwright

The best of Zyte and the data web, in your inbox.