12 minute read

RAG Systems vs. LCW: Performance and Cost Trade-offs

Published on
by
Yurts summary

Large language models (LLMs) with extremely long context windows (LCW) have become a focal point of discussion, particularly for enterprise adoption of GenAI [1,2,3,4]. Long context window LLMs such as Gemini, GPT-4o, Llama-1-405B, and Mixtral have further fueled this interest.

With the emergence of LCW models, came the development of academic benchmarks to quantify the effectiveness of these large systems. The popular benchmarks designed for LCW evaluation are as follows: 

  1. Needle in a Haystack (and its variants) [5,6]
  2. Infinity Bench [7]
  3. ZeroSCROLLS [8]

In this article, we conduct a comparative analysis between LCW models, utilizing in-context learning, and retrieval-augmented generation (RAG) systems paired with smaller context window models. This comparison spans both ends of the AI architectural size spectrum and is applied to two variants of the Needle in a Haystack (NIAH) benchmark.

Our major contributions in this study are twofold: 

  1. We demonstrate that RAG systems paired with small context window models are more performant and cost effective than Long context window (LCW) models for use-cases that involve retrieval and analysis of specific information from a very large corpus of documents. 
  2. Academic benchmarks (like Needle in a haystack) aren't well suited for demonstrating the unique abilities of LCW models in Enterprise AI scenarios. This necessitates the push to develop better benchmarks that can uniquely highlight abilities of LCW models that aren’t available in smaller systems. Additionally, the NIAH assays are not in touch with the reality of enterprise information as most haystacks are not neatly formatted text, but are instead trapped within documents like pdfs, ppts, and other document formats.

What is the Needle in a Haystack benchmark?

        Figure 1. Needle of text inserted within a haystack of information 

Needle in a Haystack benchmark, as the name suggests, is a synthetic document wherein a “needle” (i.e. a relevant piece of information) is hidden with a large “Haystack” (i.e. large corpus of text); And models (or systems of models) are evaluated with a query that aims to retrieve that particular needle. 

The NIAH benchmark is highly relevant for Enterprise AI use cases that involve accessing and analyzing specific pieces of information from a large corpus of text and documents within enterprises.

We evaluate the two systems (RAG and LCW models) on two variants of the needle-in-a-haystack benchmark, namely: 

  1. Biographies (NIAH) benchmark: Here, a single needle (that is very closely related in content to the haystack) is inserted into the haystack at an arbitrary position. Using a testset of 140 biographical questions for every haystack size, we compare and contrast different systems, using haystacks ranging from 2k to 2M tokens. The accuracy scores reported (for every haystack size) corresponds to the number of biographical questions that the model (or system of models) got perfectly right for every distinct haystack size. (We  provide this dataset on hugging face for easy access: Biographies-HF)
  1. Multi-needle in a haystack: Here, three needles are inserted into the haystack at different positions within the large corpus of text; and the model (or system of models) is evaluated on its ability to retrieve all the needles from the haystack given a relevant query. In this particular dataset, the needles are significantly different from the haystack. Specifically, the needles correspond to sentences that contain information about the best ingredients used for making pizza, while the haystack is compiled by stacking essays written by Paul Graham. To compare and contrast different systems, we compute the percentage of needles (x/3 * 100) retrieved from the haystack, while varying the size of the haystack. (We provide this dataset also  on hugging face for easy access: Multi-needle HF)

Results

To understand the effectiveness of these long context window LLMs and RAG systems, we evaluated their performance on the biographies (NIAH) benchmark.

Performance: 

Biographies (NIAH) benchmark

The biographies benchmark expects the model (or system of models) to retrieve the needle (i.e. natural language text pertaining to a particular individual being asked about) from the haystack and perform reasoning on it in order to report the following pieces of information, when available. 

  1. Date of birth: in the format of day, month, and year
  2. Date of death: in the same format as above
  3. Nationality: as a string. If not provided in the description, it should be reported as “unknown.”
  4. Politician: as a boolean. To infer if the individual is a politician based on their description.
  5. Sportsperson: as a boolean. To infer if the individual is a sportsperson based on their description.

The RAG systems evaluated are:

  1. Yurts RAG (end-to-end) system coupled with open-source Llama-3-8B-instruct, which has a maximum context window of 8k tokens (blue curve).
  2. Contextual AI’s RAG 2.0 (red curve) [values obtained from their blog post].

On the other hand, the LCW model evaluated is GPT-4 that has a maximum context window of 32k tokens (yellow curve). As the maximum context window of the GPT-4 system used in this analysis is 32k, for documents with larger than 32k tokens, we are unable to perform in-context learning to retrieve the needle from the haystack.

Figure 2. Performance of RAG vs Long context window models on the Biographies Benchmark. 

As demonstrated in Figure 2, we find that RAG systems coupled with smaller context window models (blue and red curves) are significantly more performant than the LCW models (yellow curve). Additionally, it is noteworthy that the performance of RAG systems are almost constant as the size of the haystack is varied from 2k to 2M tokens, while the LCW models demonstrate a sharp drop in accuracy as the haystack size increases. The LCW model tested here (GPT-4 32k) has a maximum context window of 32k tokens, making it impossible to evaluate in-context learning on documents with more than 32k tokens. The minor difference between the two RAG systems on the benchmark could be a product of forced JSON creation, which is a well known source of hallucination and degradation[10]

Although RAG systems are way more performant than LCW models on this benchmark, we identified a couple of reasons why the RAG systems don’t get a 100% accuracy on the task across different haystack sizes:

  1. Model fails on reasoning: Although the retrieval side of the RAG stack successfully discovered the needle from the haystack and brought it into the context window for the model to infer whether the person was a sportsperson, politician, etc., the model produced a faulty inference. For instance, even though the biographical information pertained to an individual who was a serial killer, the model incorrectly inferred that the person was a sportsperson.
  2. Errors in the ground truth: We also discovered that the ground truth (gold response) in the benchmark was incorrect at times - forcing the score to be lower than a 100%. For instance, when this biography needle: "alejandra somers ( born 5 april 1908 , date of death unknown ) was an english professional footballer. He was born in catshill" was inserted into the haystack, the ground truth mentioned that the nationality was “unknown”, while an inference of “English” being the nationality isn’t completely incorrect given the description above. 

Multi-needle in a haystack

In addition to testing the model (or system of models) performance on retrieving a single needle (similar in genre to the haystack), we would like to subject the system to discovering and analyzing multiple hidden needles from a haystack. In this dataset, the hidden needles are distributed across the haystack and are very different in genre from the haystack, with the goal of evaluating if the system can retrieve distributed hidden needles from haystacks with a lot of distracting text. 

The haystack is composed of a collection of Paul Graham’s essays, while the 3 needles inserted in the haystack are: 

  1. Figs are one of the secret ingredients needed to build the perfect pizza.
  2. Prosciutto is one of the secret ingredients needed to build the perfect pizza.
  3. Goat cheese is one of the secret ingredients needed to build the perfect pizza.

The retrieval question that different systems are subjected to, for retrieving all the different needles from the haystack, is: “What are the secret ingredients needed to build the perfect pizza?”

For this multi-needle retrieval analysis, we compare two systems:

  1. Yurts RAG (end-to-end) coupled with Llama-3-8B-instruct
  2. GPT-4o with a 128k tokens context window
Figure 3. Performance of Yurts RAG (red) vs GPT-4o (blue) on Multi-needle in a haystack benchmark

As shown in Figure 3, we find that the performance of in-context learning using GPT-4o falls from a perfect 100% as the document size (haystack size) increases. We also note that there are some haystack sizes that are more vulnerable than others, for instance, GPT-4o is able to discover only 1 needle of the 3 inserted needles (~33%) when inserted across different depths in a haystack of size 65k tokens, while the system does much better and retrieves all needles placed across different depths when reading a haystack of size 100k tokens. Also, the accuracy of retrieving needles is zero for haystacks greater than 128k tokens because GPT-4o has a maximum context window size of 128k tokens, preventing in-context learning for larger documents (or haystacks). 

On the other hand, we observe that the Yurts RAG (end-to-end) system is able to retrieve all the needles across the different document (haystack) sizes. We get a perfect retrieval because the retrieval question is very straightforward and does not require any complex reasoning from the coupled LLM (here, Llama-3-8b-instruct). 

To further examine the relationship between the accuracy of needle retrieval with the depth of needle insertion into the haystack, we insert the needle into multiple depths and pose the retrieval question to the system being evaluated. 

We evaluated the mean accuracy over (3 runs) for GPT-4o subjected to the same task, with needles inserted into different depths within the haystack. We find that (Figure 4) GPT-4o typically is successful in retrieving the relevant needles from the haystack when the needle is either placed in the start or end of the document; but struggles when it’s placed within the document. On the other hand, the Yurts RAG system (Figure 5) is able to retrieve the relevant needles irrespective of its position within the haystack - a necessity while working with large volumes of documents within enterprises. 

Figure 4. GPT-4o performance on Multi-needle in a haystack with needle depth resolution
Figure 5. Yurts RAG (end-to-end) performance on Multi-needle in a haystack with needle depth resolution

GPU compute requirements / compute costs

Having demonstrated that the raw accuracy of using smaller RAG systems coupled with smaller context window models is more capable than in-context learning using very long context window models, we want to highlight the compute requirements and costs associated with each of the systems (Table 1). 

Yurts RAG (end-to-end) combined with Llama-3-8B-instruct requires up to 2 A10 GPUs for single-user operations and can scale to support 50 concurrent users with 4 A10 GPUs. While the RAG approach involves additional retrieval components, such as ElasticSearch, Vector databases, and smaller neural networks, we emphasize GPU requirements as they continue to be the primary cost driver. 

On the other hand, long context window models can require a minimum 40 A10 GPUs for their inference (for a single user). As the number of parameters and architecture of GPT-4o is not public knowledge, we will detail the GPU requirements for an open-source model that has been tuned to endow it with a very long context window. The model we will investigate is Gradient AI’s  Llama-3-8B-Instruct-Gradient-1048k. On pilot experiments using this model, we find that for naive hugging face implementation of the model for inference (without any optimizations), the model requires close to 80GB of GPU memory (i.e. a minimum of 3 A10 GPUs) for 16k tokens context window. We estimate that utilizing the entire 1M context window would require close to 1000GB of GPU memory (roughly about 40 A10 GPUs) - for a single user. We believe that using a vllm implementation for inference could enable more concurrency using the same 40 A10 GPU resource. 

Table 1. Yurts RAG vs LCW model compute cost analysis

*A10 is a commercial GPU with a max GPU memory of 24GB. We use the costs (1.624$/hr) of renting GPU’s on-demand via AWS cloud for our cost evaluation.  

RAG: A necessary filter before the LLM

As some of the longer context window models are available via API and are charged on a per token basis, we would like to demonstrate that applying a RAG filter on the entire document corpus prior to using long context window models is both efficient and cost-effective. For instance, we demonstrate that applying RAG on the Biographies benchmark (described above), consistently filtered out very short pieces of text from the large document (haystack) (for feeding to the LLM) ultimately ensuring that the model can perform in-context learning accurately and cost-effectively on the shorter context. 

Figure 6. Effective context window lengths after Yurts RAG filtering of haystacks. Error bars are Standard Errors

For instance in Figure 6, we find that despite the increasing haystack size (from 2k to 2M tokens), the effective context window (i.e. RAG filtered context fed to the smaller context window LLM) is typically in the range of 250 to 350 tokens. As recent work[11] has demonstrated that open-source models are best performant on reasoning and analysis tasks when provided with shorter contexts (or prompts), we believe RAG as an efficient filter coupled with an LLM would provide for a highly performant system. 

Conclusions and future work:

Conclusions: 

We have demonstrated that RAG systems are way more performant than LCW models on popular academic benchmarks (like the Needle in a haystack variants) developed to test the effectiveness of long context window models. 

Moreover, we have shown that RAG systems can easily scale to large document corpora, as exemplified with documents containing 2 million tokens, without any degradation in performance or accuracy.

In addition to performance improvements, we note that RAG systems require very low compute resources as compared to long context window models - making the former a perfect candidate for enterprise adoption of Generative AI.

Future work: 

Although we’ve shown that long context window models are simultaneously performance inadequate and cost-ineffective compared to their RAG counterparts (specifically the Yurts RAG system), we encourage the community to build better academic benchmarks that truly highlight the unique abilities of LCW models, which cannot be matched by smaller systems. For instance, designing tasks and benchmarks that genuinely necessitate very long contexts for the completion of user workflows.

References

  1. https://research.ibm.com/blog/larger-context-window
  2. https://gradient.ai/blog/the-haystack-matters-for-niah-evals
  3. https://www.ai21.com/blog/7-long-context-trends-in-the-enterprise
  4. https://www.reddit.com/r/singularity/comments/1b05zpe/i_cant_emphasize_enough_how_mindblowing_extremely/
  5. https://github.com/anyscale/long-context-fine-tuning-blogpost/
  6. https://github.com/gkamradt/LLMTest_NeedleInAHaystack
  7. https://arxiv.org/pdf/2402.13718
  8. https://arxiv.org/abs/2305.14196
  9. https://contextual.ai/introducing-rag2/
  10. https://arxiv.org/pdf/2408.02442v1
  11. https://arxiv.org/abs/2404.06910

Frequently asked questions

Stay up to date with enterprise AI
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
written by
Guruprasad Raghavan
Lead Research Scientist and Founder
12 minute read