Generative AI has the promise to revolutionize the way we work. However, we still have a long way to go from "revolutionary” to “practical”. At Yurts, we focus on the practical applications of generative AI in large organizations. Previously, we’ve written about the non-flashy, but necessary features of a real enterprise-ready AI platform in our article “The Promise and Peril of Generative AI for Enterprises.” Today, we aim to discuss the accuracy, costs, and benefits of various generative AI pipeline architectures, focusing on a primary use case: knowledge retrieval and discovery.
Big organizations generate a ton of information. Some of it is structured into neat little rows and columns that can be queried with powerful computational capabilities. But much of it is messy, disparate, unorganized and spread all over different systems. Think contracts, meeting summaries, field reports, HR benefit statements, and accounting ledger notes. How often have you needed a question answered, and you don’t care what system it's in, or in what document? You just want an answer.
Broadly, in a generative AI information retrieval task, there are two ways of approaching the problem. First, you can give the entirety of a document (or set of documents) to a large language model (LLM), and prompt the model to answer a question. This requires the model to have a long context window, or more plainly, the model has to be able to accept all of the data in the document (or documents). The model inhales the data from the document, and exhales a response to the prompt. Long context window models have become all the rage, leading to an arms race for larger and larger context windows. Anthropic’s Claude allows you to input up to 200,000 tokens, or about 150,000 words. It can ingest A Tale of Two Cities, a remarkable achievement. (Although not offered, Anthropic has stated the models themselves can accept significantly larger inputs.)
The other approach is to split the task into two steps: First, you find the relevant portions of content within the document(s). Then you feed only that information (not the whole document) and the user’s prompt to the LLM. This process is Retrieval Augmented Generation (RAG). In this approach, because you’ve hunted down the relevant information first, the context window doesn’t need to be large enough for a Dickens novel. The Cat in the Hat will usually do.
Note: It’s conceivable to train an LLM directly on your company’s data. However, the data security risks and attribution challenges of this approach make it an impractical solution.
We’ve compared the accuracy of these two approaches using a standard open-source, 3rd party dataset called “Needle in a haystack - Biographies benchmark”. The task is to find and retrieve bits of information (“needles”) buried in larger documents. We ran our RAG pipeline to find these needles, and compare against results from GPT4, a long context window model with 32K token capacity, which was given the entire document.
In the plot below you can see the accuracy of the Yurts RAG pipeline vs. GPT4. We’ve also included results produced by Contextual.ai and their RAG pipeline. For larger and larger documents the long context window model performs worse, while the RAG solutions perform at a higher accuracy across the board.
It is important not to get distracted by the small difference between Yurts and Contextual in this task. The discrepancy here falls well within the margin of error introduced when one moves from this somewhat academic task to real-world enterprise datasets. The important point is that for the most common workplace application, knowledge retrieval, RAG solutions provide a more accurate result than long context window approaches.
Interestingly, the lynchpin for the RAG approach isn’t some new technology or technique, but rather something that companies and researchers have had four decades to perfect: search. It is simply more accurate to rely on traditional search methods to identify the most relevant information in a large corpus than to ask a LLM to do it.
This is really no different from a very human phenomenon that I’m sure you’re all familiar with. If you read A Tale of Two Cities, and are asked to answer a question from that book, you are less likely to get it right than if I give you a short brief on the relevant subject, and then ask the same question. It seems artificial and human intelligence have that in common.
We’ve made a similar analysis with a different dataset requiring pulling multiple bits of information scattered through a document, (Multi-needle in a haystack), and get similar results. This dataset highlights how effective RAG can be at these kinds of problems, and our approach aces the test. (It gets every task right.). The long context window approach again struggles.
In fact, the long context window approach commits a far greater sin than struggling with accuracy: it doesn’t scale. Take Claude, for example. Using the hosted version of Claude on AWS costs $0.015 per 1000 input tokens plus $0.075 per 1000 output tokens. What does that mean in human terms? Asking “What are the two cities in A Tale of Two Cities?" That’ll cost you more than $3. With our RAG approach it's more like $0.03. For further insights into how Yurts delivers cost-effective RAG solutions without sacrificing performance, you can read more in our detailed analysis on Yurts RAG: Performance That Doesn’t Break the Bank.
Three bucks isn’t much, but as we begin to identify and scale the workflows that AI excels at, how many prompts per day do you think a typical employee would want to run? 5? 10? For a 1,000 person company, with each person running 5 a day: 5 x 1000 x $3 = $15,000/day.
Now, that’s not totally realistic, because at that scale you would want to host it yourself, right? Well, the memory requirements for these long context window models can be enormous. To reach the memory required, you’re going to need a minimum of 40 GPUs (A10’s) driving a daily GPU cost of ~$1,560 for a single instance. (And at that volume, you’ll need a couple more.) Yurts, meanwhile, can run on 2 A10 GPUs.
Of course this analysis has revolved around a particular use case, knowledge retrieval and discovery, an application that we find to be the most AI shovel-ready. It's probably a huge pain point in your organization, and an opportunity to introduce significant efficiencies right now. Long context window LLMs certainly have their applications for which an approach like RAG will simply not work. But you should not let the wonder of record-breaking achievements in AI distract from what you are trying to accomplish in your company.