Large Language Models (LLMs) are full of promise: instant access to a vast ocean of knowledge.
But what if you need a river, not an ocean? And what if you want the finest, freshest water? That’s when you need to go to the source.
Although LLMs capture the knowledge up to their training date, they are plagued by knowledge cut-offs, prone to "hallucinations", lack specialized domain knowledge, and they don’t like to cite their sources.
That is changing, as products like ChatGPT Plus gain the ability to dip into the web right now. But the web is a big place - despite seemingly infinite information, it often lacks the specialised data, those needles in the haystack, that many businesses require. Even when your LLM “browses the web”, you cannot be certain it is doing so meaningfully.
In other words, mass-market LLMs perform poorly on both recency and relevance. That’s why many businesses wanting a specialist, up-to-date knowledge engine are turning toward Retrieval Augmented Generation (RAG).
Rise of the RAG
RAG is a way to combine the strengths of two different AI approaches:
Retrieval: Instead of searching the entire internet, this specialised search system looks through a carefully curated collection of documents – your company's knowledge base, industry reports, specific websites you trust, etc. – to find the most relevant information.
Generation: This is where the LLM comes in. Instead of relying solely on its built-in knowledge, the LLM uses the retrieved information to generate a comprehensive and accurate answer.
RAG connects a pre-trained LLM to a body of preferred, up-to-date, authoritative information.
Imagine you need to track the prices of specific electronic components from a select group of manufacturers. A general-purpose LLM, even one that can browse, might give you a rough average or information from outdated sources, it might hallucinate prices altogether or prioritize consumer-facing websites over the specialized data you actually need.
With RAG, you don't leave it to chance. You build a knowledge assistant that:
Knows exactly where to look: You define the specific websites, databases, and documents that contain the authoritative information. No more wading through irrelevant search results.
Stays up-to-date: Your data sources are constantly refreshed through targeted web scraping, ensuring your LLM is always working with the most current information relevant to you.
Speaks your language: You’re building an AI that understands the nuances of your domain – the specific product codes, the industry jargon, the critical metrics that matter to your business.
Provides transparency: You know the source of every piece of information, allowing for verification and building trust in the AI's responses.
The road to RAG riches
So, how do you build a RAG-powered AI brain? It’s a four-step process:
Gathering the raw material: This is where acquiring data from the web, with high levels of control, comes in. Whether it is industry publications, online databases, competitor websites or your own existing knowledge base, you identify and obtain only the authoritative sources of information relevant to your task.
Creating a specialized memory: The extracted data is processed and stored in a "vector store." Think of this as a highly organized library, where information is indexed for quick and relevant retrieval in semantic ‘chunks’.
The intelligent librarian: A "smart retriever" acts as the intermediary. When a user submits a query, the retriever searches the vector store for the most relevant information, not the whole internet.
Augmented generation: The LLM is then presented with the original query along with the retrieved context from your curated data. It’s now reasoning with the freshest, most relevant, and most trusted information.
The most common frameworks for RAG development are LlamaIndex and LangChain. But, as you can tell, it all starts with finding and gathering the right source material for your business.
Putting it into practice
That’s where web data acquisition tools come in. Unless you are building a RAG system wholly from private data, the world of web data, narrowed down to your preference, will be your starting point. At Zyte, we contributed plugins for LlamaIndex, allowing the RAG framework to leverage our data acquisition capabilities. Let’s look at a LlamaIndex project.
1. Start with search
If you don’t know which pages to extract data from yet, you can find them with a web search. With the ZyteSerpReader plugin, you can carry out a search engine query returning the top results as a structured list of URLs.
topic = "St Patricks day 2025 program in Dublin Ireland"
serp_reader = ZyteSerpReader(api_key=ZYTE_API_KEY)
search_results = serp_reader.load_data(topic)
serp_urls = []
for doc in search_results:
url = doc.text
metadata = doc.metadata
print(f"URL : {url}")
serp_urls.append(url)2. Get page content
That URL list becomes the input for the next stage - obtaining the content of each page as clean, LLM-friendly text. The ZyteWebReader plugin, another wrapper around the Zyte API, returns each page’s content as either a clean article object as Markdown, html-text or html itself.
web_reader_zyte = ZyteWebReader(api_key=ZYTE_API_KEY, mode="article")
documents_zyte = web_reader_zyte.load_data(serp_urls)3. Vectorize the knowledge
Using LlamaIndex’s VectorStoreIndex class, you index all your preferred content into chunks called “vectors”. Hey presto, the resulting object is the basis for your new, private specialist knowledge base.
serp_index = VectorStoreIndex.from_documents(documents_zyte)4. Query your expert brain
Want to put your bigger brain through its paces? Use LlamaIndex’s as_query_engine() method to spin up a query engine that leans on your vector store.
query_engine = serp_index.as_query_engine()
response = query_engine.query(
"When and what time did the Parade take place on St Patricks day in Dublin in 2025?"
)
print(response)The response draws on the information you found and used to populate your specialist knowledge base.
The future is RAG
Relying on generic, black-box LLMs for critical business tasks has its limitations.
While the ability of some LLMs to browse the web is a step forward, it doesn't address the fundamental need for control, precision, and domain-specific expertise. That’s what RAG provides.
Tools like Zyte’s help you find, fuse and freshen the data you need to be RAG-ready:
You define only the specific websites, databases, and documents that contain the authoritative information. No more off-topic searches for your chatbots.
Your data sources are constantly refreshed through targeted web scraping, ensuring your LLM is always working with the most current information relevant to you.
Don't just let your LLM browse the web – empower it with the knowledge it needs to truly understand and serve your business.
