Join our waitlist
We are currently in a private beta. Join our waitlist to get priority access. Beta users also have a special pricing when we launch. We promise we won't spam.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Production Ready RAG for Enterprise

January 2, 2023

Introduction to Large Language Models in Business

The hype and interest around Large Language Models (LLMs) are at an all-time high, with enterprises keen to harness this technology to enhance their business workflows. LLMs, known for their advanced AI capabilities, are revolutionizing various industries. However, integrating them into business operations is not without challenges.

Understanding the Gaps in Traditional LLM Adoption for Enterprises

While LLMs offer impressive capabilities, businesses often encounter hurdles in their practical application. A significant gap is the integration of internal, proprietary data into these models. This limitation hinders immediate and effective adoption in enterprise settings.

But before we go further, what is RAG?

Exploring Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) emerges as a promising solution, bridging the gap between LLMs and enterprise-specific data needs. RAG is a framework that enhances LLMs by incorporating external data sources, extending beyond the model's initial training data. This approach significantly improves the accuracy, relevance, and reliability of the generated content.

Research indicates that 36% of enterprise LLM adoption revolves around RAG. Its capability to integrate structured and unstructured data makes information retrieval more efficient, prompting tech giants like Microsoft Azure and Amazon Web Services to develop tailored RAG solutions.

Why use RAG?

Any foundational large language model (LLM) was trained with point-in-time data. However, if you want the LLM to generate content based on your internal data you have two distinct approaches to supplement information to the base model: fine-tuning or further training of the base model with new data, or RAG that uses prompt engineering to supplement or guide the model in real time.

Fine-tuning a model requires you to retrain the foundational model, using your proprietary data and tweak the model in order for the model to behave the way you want it. While this allows you to significantly change the output of the model, fine-tuning is a lengthy and costly process that does not guarantee favorable outcome.

In contrast to fine-tuning, RAG offers a more streamlined approach, allowing the use of the same model as a reasoning engine over new data provided in a prompt. This technique enables in-context learning without the need for expensive fine-tuning, empowering businesses to use LLMs more efficiently.

RAG enables businesses to keep their data up-to-date and relevant. It allows for periodic updates and integrations without the need for continuous model training, ensuring that the LLM outputs remain accurate and applicable. As RAG continues to evolve, its potential in business intelligence and analytics is vast. It promises to play a pivotal role in data-driven decision-making, predictive analytics, and personalized customer experiences.

How does RAG work?

RAG is primarily divided into four stages: Data Preparation, Embeddings Generation, Data Retrieval, and Response Generation. Each step of the process influences how well the RAG system performs.

High Level RAG Architecture
High Level RAG Architecture

Data Preparation

Initially, in the Data Preparation phase, relevant data sources are identified and curated. This stage involves gathering and organizing both structured (like databases) and unstructured data (text documents) that the model will use to generate informed responses. The quality and relevance of the data collected are paramount. The data must be thoroughly cleaned and formatted to ensure consistency and accuracy. High-quality data preparation is crucial because it forms the foundation upon which the RAG system will operate. Poorly prepared data can lead to inaccuracies in later stages, reducing the overall effectiveness of the system.

Embeddings Generation

The next stage where the prepared data is processed to create embeddings. Embeddings are high-dimensional representations of the data, designed to capture the nuances and context of the information, making it understandable for the model. These embeddings capture the semantic meanings and relationships within the data. The performance of the RAG system heavily depends on the quality of these embeddings. Accurate and well-structured embeddings allow the system to retrieve more relevant information in later stages, thereby improving the precision and contextuality of the final output.

A key aspect of that impacts embeddings generation is determining how the processed data is “chunked” or segmented. The data needs to be divided into manageable, coherent segments or blocks before processed to create embeddings. The size of these chunks can significantly impact the performance of the RAG system in several ways:

Smaller chunks capture higher levels of detail and specific information. However, too small chunks might miss the broader context or overlook the interconnectedness of data points. Conversely, larger chunks encompass more context but might dilute specific details. Balancing chunk size is crucial to maintain both details and context in the embeddings.For the same size of data, smaller chunk leads to a larger number of embeddings, potentially increasing the computational load during the retrieval process. This can impact the speed and efficiency of data retrieval. Larger chunks, while reducing the number of embeddings, might result in the retrieval of less relevant information, which could impact the performance of the RAG system.

Data Retrieval

When a query is received, the model uses these embeddings to search and retrieve the most relevant pieces of information from the prepared dataset based on the query. This retrieval is based on the query or prompt provided to the model, ensuring that the information fetched is pertinent to the user's request. The efficiency of this retrieval process significantly impacts the system's performance. Faster and more accurate retrieval leads to a quicker response time and ensures that the most relevant data is used in the response generation. It’s crucial to ensure the embedding retrieval is done correctly because poor retrieval mechanisms can lead to suboptimal results, such as irrelevant or incomplete information being used in the final response.

Response Generation

Once the relevant information is retrieved, the model synthesizes the response with its pre-trained knowledge, crafting a coherent and contextually accurate response. This response is not only based on the model's foundational training but is also enhanced by the specific, real-time data provided through the RAG process, leading to more accurate and relevant outputs. The system's ability to effectively integrate the retrieved data with its pre-trained models is key to producing accurate, relevant, and detailed responses. The better the integration and synthesis of new and existing information, the more useful and insightful the final output will be to the user.

Simple RAG

Equipped with the basic information of how RAG works, now you can build a simple RAG. Here’s an easy way to get started using code from LlamaIndex: Colab Notebook

Building a functional RAG is not difficult and there are plenty of tools out there that helps you build a simple RAG. But you’d realise quickly the simple RAG would get answers wrong frequently or may not even return the information despite it being in the original document. It works well for simple questions on a simple document but would not be able to answer anything complex. Beyond the poor retrieval and hallucination issues, you’d also notice there could be long latency during the retrieval process.

Production-ready RAG

There is still a huge gap between a POC and reliable RAG that can be used day-to-day. RAG is simple to build but difficult to productionize due to the nature of LLM. It takes time to run trial-and-error experiment to understand how RAG could serve your business. There are plenty of more advanced RAG technique (which I will share in the next post) to improve the performance of the system to create a highly reliable and consistent RAG system.

As the Generative AI ecosystem continue to evolve, the opportunity to adapt the technology to improve your business and workflow is within your grasp. If you want to experiment with advanced RAG without going through the entire development cycle, give us a try at

Be among the first to adopt the cutting-edge capability tailored to empower your business. And if you are looking for someone to help you craft your AI adoption strategy, feel free to reach out to us.

Join our waitlist