Highlights
You'll learn how Yurts' Retrieval Augmented Generation (RAG) system delivers high performance at a fraction of the cost.
Read on to uncover the power and efficiency of Yurts' RAG platform for enterprise AI assistants.
At Yurts, we are in the trade of building helpful AI assistants for enterprises that (a) generate answers grounded on THEIR massive proprietary knowledge base, (b) scale to a large number of end-knowledge workers in the enterprise; and (c) respect and mirror access controls previously established in the enterprise. We accomplish this using a multi-faceted approach, which involves building a powerful and robust Retrieval Augmented Generation (RAG) system to support natural language queries.
RAG is a powerful technique which enables Large Language Models (LLMs) to answer questions about a user’s private knowledge base without actually having to train these models on private data. In a typical RAG system, a user's knowledge base is broken down (chunked) into smaller text pieces and each piece is embedded into a vector space using an embedding model. These embeddings capture the semantic information present in the chunks and enable a direct comparison against a natural language query (by embedding this query into the same vector space) to enable retrieving the most relevant chunks for a given query.
One of the major challenges with chunking and embedding is that chunks embedded independent of each other can lack appropriate context. For example, let’s consider a knowledge base that comprises the following text: “The Golden Gate Bridge is a historical landmark in the city of San Francisco. Its red color makes it easily recognizable from a distance”. If our chunking technique ends up separating the two sentences, it would be impossible to disambiguate the “It” in the second sentence without additional context. Many techniques have been developed to address these kinds of shortcomings, like generic document summaries to chunks, hypothetical document embedding and summary-based indexing.
Recently, Anthropic introduced a new technique called Contextual Retrieval to better address this issue. In contextual retrieval, every chunk is enhanced by a concise chunk specific context that explains the chunk using the overall document. This is achieved by prompting their Claude model appropriately with the chunk and the whole document. For examples, the two chunks for the previous knowledge base could look like:
In this post, we make the following contributions:
Details on the Codebases dataset: The Codebases dataset is divided into three different components:
More statistics concerning the Codebases dataset can be found in the appendix.
Evaluating Yurts: For evaluating Yurts’ RAG platform on Codebases dataset we ingest the 90 documents present in the dataset. The platform then uses its custom chunking and embedding technique to store the documents in a manner that enables superior retrieval performance. Following this, we issue the platform with the 248 queries present in the dataset and collect top-5 chunks per query that our platform deems are most relevant in answering those queries.
Unlike existing techniques (including Contextual Retrieval), where chunking is static and done only once before the embedding stage, we use dynamic approaches to ensure that no critical information relevant to a user's query is left out. This makes it difficult to directly compare our top-5 retrievals with the golden chunks present in the Codebases dataset as our chunks might not align with the ones present in the dataset.
To remediate this we take a two-step approach. First, we evaluate an edit-distance between the top-5 chunks retrieved by our platform with the corresponding golden chunks. Second, we use a human-in-the-loop evaluation. We use this step to ensure that chunks retrieved by our platform which contain the part of the golden chunk relevant to answer the question are counted towards the final score. This is important because we do not want to penalize a retrieval system for retrieving chunks that do not align well with the extraneous information that might be contained in the golden chunks.
Our Results: On running our two step evaluation on the Codebases benchmark we found the following:
After aggregating the results from the above evaluations our final results are as follows:
While comparing different combinations of SOTA embedding models and retrieval algorithms on the same benchmark (Codebases), we find that Yurts RAG performs at 93.2% (i.e. 93.2% of the queries from the benchmark have the golden-chunk within the top-5 retrieved contexts), while the other methods evaluated display a performance range from 80.9% to 93.9%, with the best performance being enabled by Contextual Retrieval. The performance difference between Contextual Retrieval and Yurts RAG system isn’t statistically significant (details of this calculation can be found in the appendix).
In this section we evaluate the two techniques for their cost to the end customer. Such an analysis is crucial since it can be a driving factor for choosing one technique over another in a business centric application. This analysis is carried out by comparing various costs associated with model deployment, API calls, context generation, and running embedding models for the two techniques.
The primary costs associated with the Contextual Retrieval technique correspond to the number of API calls made to the:
On the other hand, document ingestion in the Yurts platform is enabled by the usage of very small language models, optimized for on-premise deployment to perform RAG. Although the Yurts RAG system can be hosted on-premise to perform RAG, for a fair comparison, we assume that the GPUs are acquired on an hourly basis from a cloud provider (for instance, AWS) and use the costs associated with them for our calculation. As GPU costs are the most significant expenses, our cost analysis would focus on calculating the latency as well as the GPU memory consumption of the dynamic chunking workflow powering ingestion on the Yurts platform. (There are incremental costs associated with storage, data I/O).
We get the following costs for enterprises of various scales from our analysis (for a detailed step-by-step calculation of the two methods, refer to the appendix):
What we see from above is that for a medium-sized enterprise (100-1k employees), Yurts’ ingestion and dynamic chunking methodology would cost ~360$ while using an approach akin to Contextual Retrieval, that requires long context window models and external embedding models, would require ~$109k! Almost 300x more! Also, When we scale these numbers to different sizes of enterprises, we quickly find that the Contextual Retrieval methodology becomes prohibitively expensive.
One has to remember that ingestion, though crucial at the beginning, is not necessarily a one-time cost. Documents at enterprises are updated regularly at different cadences which might require them to be re-ingested. For instance, documents like operation logs, Jira/Linear tickets get updated on a daily basis, product release notes get updated on a weekly basis, FAQs for product releases get updated on a quarterly basis and HR documents, employee training manuals, product manuals typically get updated on a yearly basis further adding to the cost of ingestion.
In this post we compare Yurts’ dynamic chunking and embedding strategy to Anthropic’s new Contextual Retrieval technique. Some key takeaways from this analysis are:
Statistical significance of the performance scores (between Yurts RAG and the Contextual Retrieval)
Evaluating the statistical significance via t-test.
The two-tailed P value equals 0.7544. By conventional criteria, this difference is considered to be not statistically significant.
For this analysis we consider a collection of 10k documents each with about 300 pages (this corresponds to an average of 192k tokens in each document). This is a good representation of a typical knowledge base in a mid-size enterprise. A similar analysis can be conducted for smaller and larger enterprises. We now calculate the various costs associated with model deployment, API calls, context generation, and running embedding models for the two techniques.
We start with the Contextual Retrieval technique. It requires two models for the ingestion process:
The primary cost for this model will correspond to the number of API calls made to both the models which we calculate in the table below.
We get the following table which lists the various important quantities that goes in our cost analysis together with how these quantities are calculated and what the final numbers look like:
[1] https://www.anthropic.com/news/contextual-retrieval
[2] https://docs.voyageai.com/docs/pricing
Moving on to Yurts RAG system, which can be hosted on-premise to perform RAG, for a fair comparison, we assume that the GPUs are acquired on an hourly basis from a cloud provider and use the costs associated with them for our calculation. The cost analysis would include calculating the latency as well as memory consumption of our technique to understand the GPU requirement which forms the primary cost in this analysis. The resulting analysis for our system is as follows:
[3] https://aws.amazon.com/ec2/instance-types/g5/
When we scale these numbers to different sizes of enterprises, we quickly find that the Contextual Retrieval methodology becomes very expensive.