9 minute read

Yurts RAG : Performance That Doesn’t Break the Bank

Published on
by
Yurts summary
Retrieval-augmented generation (RAG) systems use many different algorithms to chunk, embed, and rank textual content for enabling natural language based question-answering on private knowledge bases. In a recent article, Anthropic introduced Contextual Retrieval, a new chunking algorithm that surpasses state-of-the-art (SOTA) methods and released the “Codebases dataset” for its benchmarking. In this article, we evaluate Yurts’ RAG pipeline on this new benchmarking dataset. This evaluation highlights that Yurts’ RAG system matches the performance of Contextual Retrieval, while operating at only 1/300th of the cost, underscoring our commitment to providing high performance solutions that are cost-effective for our customers. Finally, we also highlight some key challenges when using new benchmarking datasets to evaluate RAG platforms, offering insights for those navigating this evolving space.

At Yurts, we are in the trade of building helpful AI assistants for enterprises that (a) generate answers grounded on THEIR massive proprietary knowledge base, (b) scale to a large number of end-knowledge workers in the enterprise; and (c) respect and mirror access controls previously established in the enterprise. We accomplish this using a multi-faceted approach, which involves building a powerful and robust Retrieval Augmented Generation (RAG) system to support natural language queries. 

RAG is a powerful technique which enables Large Language Models (LLMs) to answer questions about a user’s private knowledge base without actually having to train these models on private data. In a typical RAG system, a user's knowledge base is broken down (chunked) into smaller text pieces and each piece is embedded into a vector space using an embedding model. These embeddings capture the semantic information present in the chunks and enable a direct comparison against a natural language query (by embedding this query into the same vector space) to enable retrieving the most relevant chunks for a given query.

One of the major challenges with chunking and embedding is that chunks embedded independent of each other can lack appropriate context. For example, let’s consider a knowledge base that comprises the following text: “The Golden Gate Bridge is a historical landmark in the city of San Francisco. Its red color makes it easily recognizable from a distance”. If our chunking technique ends up separating the two sentences, it would be impossible to disambiguate the “It” in the second sentence without additional context. Many techniques have been developed to address these kinds of shortcomings, like generic document summaries to chunks, hypothetical document embedding and summary-based indexing.

Recently, Anthropic introduced a new technique called Contextual Retrieval to better address this issue. In contextual retrieval, every chunk is enhanced by a concise chunk specific context that explains the chunk using the overall document. This is achieved by prompting their Claude model appropriately with the chunk and the whole document. For examples, the two chunks for the previous knowledge base could look like:

In this post, we make the following contributions: 

  • We evaluate and compare Yurts’ chunking and embedding strategy against Anthropic’s Contextual Retrieval technique on the Codebases dataset released by Anthropic; and show that for the recall@5 metric, both systems deliver comparable performance (~93%). 
  • Additionally, we share a detailed cost analysis comparison between the two methods and demonstrate that the Yurts ingestion method is 1/300th the cost of the Contextual retrieval method. 
  • And finally, we also highlight many nuances that one must contend with when using new benchmarking datasets to evaluate a RAG platform.

Yurts’ RAG and Contextual Retrieval: Examining the Approaches

Details on the Codebases dataset: The Codebases dataset is divided into three different components:

  1. Documents to be ingested: There are 90 documents which are essentially files containing code (in languages like Python, Rust, C++) pulled from different repositories.
  2. User queries: There are 248 queries that a user might ask against these documents.
  3. Golden chunks: A set of chunks called the golden chunks are provided for each query. These are essentially the parts of the code files that they deem are necessary to answer a particular query.

More statistics concerning the Codebases dataset can be found in the appendix.

Evaluating Yurts: For evaluating Yurts’ RAG platform on Codebases dataset we ingest the 90 documents present in the dataset. The platform then uses its custom chunking and embedding technique to store the documents in a manner that enables superior retrieval performance. Following this, we issue the platform with the 248 queries present in the dataset and collect top-5 chunks per query that our platform deems are most relevant in answering those queries.

Unlike existing techniques (including Contextual Retrieval), where chunking is static and done only once before the embedding stage, we use dynamic approaches to ensure that no critical information relevant to a user's query is left out. This makes it difficult to directly compare our top-5 retrievals with the golden chunks present in the Codebases dataset as our chunks might not align with the ones present in the dataset.

To remediate this we take a two-step approach. First, we evaluate an edit-distance between the top-5 chunks retrieved by our platform with the corresponding golden chunks. Second, we use a human-in-the-loop evaluation. We use this step to ensure that chunks retrieved by our platform which contain the part of the golden chunk relevant to answer the question are counted towards the final score. This is important because we do not want to penalize a retrieval system for retrieving chunks that do not align well with the extraneous information that might be contained in the golden chunks.

Our Results: On running our two step evaluation on the Codebases benchmark we found the following: 

  1. Edit-distance based evaluation: On comparing our top-5 retrieved chunks with golden chunks based using edit-distance we get a score of 55%, i.e., 55% of the time the golden chunk for the query was very close to one of the top-5 chunks retrieved by our system. While this number might seem discouraging, it in and of itself doesn’t mean too much since aligning chunks is extremely difficult and even golden chunks contain a lot of spurious information. A more nuanced approach to evaluation is necessary at this point and we proceed to do that in the next step.
  2. Human-in-the-loop evaluation: For the queries for which our system didn’t do well in the first step of our evaluation process we run a human evaluation. In this evaluation we found three broad categories of issues with the golden chunks and queries:
  • Ground truth “golden-chunks” aren’t truly golden: While adding the human-in-the-loop evaluation to identify the failure modes, we identified that close to 7% of the queries in the dataset have inappropriate golden chunks associated with them. That is, although the Codebases dataset identifies a piece of text within the document as “golden” for a given query, human evaluators unanimously agreed that those aren’t relevant for answering the query. For example:
  • A lot of queries are of very poor quality: Additionally, human evaluators identified close to 5% of the queries being issued as very poor queries, that lack context for anyone to identify pieces of information from the document corpus to answer them. For instance, in the following examples it is impossible to disambiguate which file to fetch data from:
  • Golden chunks aren’t “completely” golden: Finally, human evaluators also identified that close to 25% of the queries have associated golden chunks that contain a lot of irrelevant information. While irrelevant information might not necessarily be a problem in and of itself as long as the chunk also contains relevant information, calling such a chunk golden can be misleading. It penalizes other chunking techniques for not capturing the irrelevant information. To remediate this problem, we specifically compare the relevant pieces of information within the golden chunk to the Yurts RAG retrieved chunk and re-score our performance accordingly.

After aggregating the results from the above evaluations our final results are as follows:

Procedure Recall @ 5 score
Edit distance alignment of RAG contexts with Golden chunks 55%
+ Manually identify all golden chunks that aren’t “golden” for a user query 62.2%
+ Manually identify all generic questions (i.e. questions for which even humans won’t be able to pull any relevant information) 67.8%
+ Manually identify all RAG contexts that contain all the relevant information (while being subsets of the golden chunk) 93.2%
All techniques mentioned, barring Yurts RAG, use Voyager-Large-2 as their embedding model and Claude-2 for chunk contextualization. Approximate Cost evaluated for Large Enterprises with 1000+ employees housing ~2 million documents (detailed cost analysis in section below).
Figure credits: Alison Martin, Yurts

While comparing different combinations of SOTA embedding models and retrieval algorithms on the same benchmark (Codebases), we find that Yurts RAG performs at 93.2% (i.e. 93.2% of the queries from the benchmark have the golden-chunk within the top-5  retrieved contexts), while the other methods evaluated display a performance range from 80.9% to 93.9%, with the best performance being enabled by Contextual Retrieval. The performance difference between Contextual Retrieval and Yurts RAG system isn’t statistically significant (details of calculation in appendix). 

Ingestion cost comparison: Yurts RAG and Contextual Retrieval

In this section we evaluate the two techniques for their cost to the end customer. Such an analysis is crucial since it can be a driving factor for choosing one technique over another in a business centric application. This analysis is carried out by comparing various costs associated with model deployment, API calls, context generation, and running embedding models for the two techniques.

The primary costs associated with the Contextual Retrieval technique correspond to the number of API calls made to the: 

  • Long context window model (Claude) and 
  • Embedding model (Voyager-2) for generating contextualized chunks and embedding these chunks respectively. 

On the other hand, document ingestion in the Yurts platform is enabled by the usage of very small language models, optimized for on-premise deployment to perform RAG. Although the Yurts RAG system can be hosted on-premise to perform RAG, for a fair comparison, we assume that the GPUs are acquired on an hourly basis from a cloud provider (for instance, AWS) and use the costs associated with them for our calculation. As GPU costs are the most significant expenses, our cost analysis would focus on calculating the latency as well as the GPU memory consumption of the dynamic chunking workflow powering ingestion on the Yurts platform. (There are incremental costs associated with storage, data I/O). 

We get the following costs for enterprises of various scales from our analysis (for a detailed step-by-step calculation of the two methods, refer to the Appendix):

What we see from above is that for a medium-sized enterprise (100-1k employees), Yurts’ ingestion and dynamic chunking methodology would cost ~360$ while using an approach akin to Contextual Retrieval, that requires long context window models and external embedding models, would require ~$109k! Almost 300x more! Also, When we scale these numbers to different sizes of enterprises, we quickly find that the Contextual Retrieval methodology becomes prohibitively expensive.

One has to remember that ingestion, though crucial at the beginning, is not necessarily a one-time cost. Documents at enterprises are updated regularly at different cadences which might require them to be re-ingested. For instance, documents like operation logs, Jira/Linear tickets get updated on a daily basis, product release notes get updated on a weekly basis, FAQs for product releases get updated on a quarterly basis and HR documents, employee training manuals, product manuals typically get updated on a yearly basis further adding to the cost of ingestion.

Takeaways

In this post we compare Yurts’ dynamic chunking and embedding strategy to Anthropic’s new Contextual Retrieval technique. Some key takeaways from this analysis are:

  1. Yurts’ system performs on par with this new technique on a benchmarking dataset released exclusively to study this new technique.
  2. Many nuances, like aligning chunks, sifting generic questions, and removing irrelevant information needs to be kept in mind when working with “golden” chunks.
  3. While techniques like Contextual Retrieval which make many calls to long-context models like Claude might seemingly improve performance, they do so by increasing the cost prohibitively. They might cost as much as $1100 per ingestion.
  4. Yurts’ novel dynamic chunking and embedding strategy is at par with SOTA while keeping cost per ingestion as low as $3.50 for a corpus of 5000 documents.

Appendix

Codebases dataset Statistics


Statistical significance of the performance scores (between Yurts RAG and the Contextual Retrieval)

Evaluating the statistical significance via t-test. 

  • Number of samples in Yurts group: 248
  • Mean score of Yurts group: 0.932
  • Std deviation of Yurts: 0.253
  • Number of samples in Contextual Retrieval group: 248
  • Mean score of Contextual Retrieval’s group: 0.939
  • Std Deviation of C. Retrieval’s group: 0.238

The two-tailed P value equals 0.7544. By conventional criteria, this difference is considered to be not statistically significant.

Details of our cost analysis

For this analysis we consider a collection of 10k documents each with about 300 pages (this corresponds to an average of 192k tokens in each document). This is a good representation of a typical knowledge base in a mid-size enterprise. A similar analysis can be conducted for smaller and larger enterprises to obtain the numbers in Table ??. We now calculate the various costs associated with model deployment, API calls, context generation, and running embedding models for the two techniques.

We start with the Contextual Retrieval technique. It requires two models for the ingestion process:

  1. Long context window model (Claude, with prompt caching):  For generating the contextualized chunks. 
  2. Embedding models: For embedding chunks along with their contextualized metadata. (For this analysis, we will use API access costs for Voyager-2)

The primary cost for this model will correspond to the number of API calls made to both the models which we calculate in the table below.

We get the following table which lists the various important quantities that goes in our cost analysis together with how these quantities are calculated and what the final numbers look like:

[1] https://www.anthropic.com/news/contextual-retrieval[2] https://docs.voyageai.com/docs/pricing

Moving on to Yurts RAG system, which can be hosted on-premise to perform RAG, for a fair comparison, we assume that the GPUs are acquired on an hourly basis from a cloud provider and use the costs associated with them for our calculation. The cost analysis would include calculating the latency as well as memory consumption of our technique to understand the GPU requirement which forms the primary cost in this analysis. The resulting analysis for our system is as follows:

[3] https://aws.amazon.com/ec2/instance-types/g5/

When we scale these numbers to different sizes of enterprises, we quickly find that the Contextual Retrieval methodology becomes very expensive. 

Frequently asked questions

Stay up to date with enterprise AI
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
written by
Guruprasad Raghavan
Lead Research Scientist and Founder
9 minute read