Solving the RAG vs. Long Context Model Dilemma - The New Stack

Join our community of software engineering leaders and aspirational developers. Always
stay in-the-know by getting the most important news and exclusive content delivered
fresh to your inbox to learn more about at-scale software development.

RESUBSCRIPTION REQUIRED

It seems that you’ve previously unsubscribed from our newsletter
in the past. Click the button below to open the re-subscribe form
in a new tab. When you’re done, simply close that tab and continue
with this form to complete your subscription.

RE-SUBSCRIBE

The New Stack does not sell your information or share it with
unaffiliated third parties. By continuing, you agree to our
Terms of Use and
Privacy Policy.

Welcome and thank you for joining The New Stack community!

Please answer a few simple questions to help us deliver the news and resources you are interested in.

COUNTRY

REQUIRED

Great to meet you!

Tell us a bit about your job so we can cover the topics you find most relevant.

How many employees are in the organization you work with?

REQUIRED

Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive
Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences
and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your
first TNS newsletter.

As a JavaScript developer, what non-React tools do you use most often?

✓

Angular

✓

Astro

✓

Svelte

✓

Vue.js

✓

Other

✓

I only use React

✓

I don’t use JavaScript

2025-01-21 06:37:01

Solving the RAG vs. Long Context Model Dilemma

sponsor-couchbase,sponsored-post-contributed,

Long context models are great for reducing hallucination for certain use cases that warrant longer context but are not ideal for all situations.

Jan 21st, 2025 6:37am by

Kiran Matty

Image from Krot_Studio on Shutterstock.

Many developers have been using retrieval-augmented generation (RAG) with large-scale context corpus to build GenAI applications and tame problems such as AI hallucinations faced by general-purpose large language models (LLMs).

Now long context models are emerging like Gemini with a context window of 2 million tokens, and its potential benefits make you wonder whether you should ditch RAG altogether. The key to dealing with this dilemma is to understand the pros and cons of using a long context model and make an informed decision about its suitability for your use case.

Benefits and Limitations of RAG vs. Long Context Models

Traditionally LLMs have had smaller context windows that limit the amount of text or tokens that can be processed at once. RAG has been an effective solution thus far to address this limitation. By retrieving the most relevant chunks of text or context, augmenting the user prompt with it and then passing those to the LLM, RAG works effectively with much larger data sets than the context window would normally support.

However, a long context model such as Gemini directly allows processing the provided context, without needing a separate RAG system, simplifying application workflow and potentially reducing latency. To put a context window of 1 million tokens into perspective, it is equivalent to eight average-length English novels or the transcripts of over 200 average-length podcast episodes. However, it’s not a panacea for reducing hallucinations by any means and has its share of limitations.

First, long context models suffer from a diminished focus on relevant information, which leads to potential degradation in answer quality per research from NVIDIA.

Second, for use cases such as QA chatbots, it’s not so much about the quantity of the information in the context but rather the quality. Higher-quality context is achieved via highly selective granular searches specific to the question asked, which is what RAG enables.

Finally, long context models require more GPU resources for processing the long context, leading to higher processing times and higher costs. Suffice to say that these models have higher costs per query. You may be able to address this using the key-value (KV) cache to cache the input tokens to be reused across requests, but that has significant GPU memory requirements and hence drives up the associated costs. The key is to achieve high answer quality with fewer input tokens.

Despite its limitations, long context models support a few compelling use cases that require longer context such as translation or summarization, for example, translating documents from English into Sanskrit (the least-spoken language in India) for educational purposes. LLMs struggle with such translation into Sanskrit due to the language’s complex grammatical structure and the limited availability of training data compared to other widely spoken languages. Hence, providing a sufficiently large number of examples as context will help boost the accuracy of the translation. Other ways include summarization and comparison across multiple large documents at once to generate insights, for example, comparing the 10K reports of multiple companies to create financial benchmarks.

Long context models are great for reducing hallucination for certain use cases that warrant longer context. However, for all other use cases, we recommend using RAG to retrieve context relevant to answer the user’s question with high accuracy and cost-effectiveness. If RAG does not meet the desired accuracy, we suggest using RAG in conjunction with fine-tuning to increase domain specificity.

Couchbase’s Capella AI Services helps developers like you build performant RAG and agentic applications quickly. Feel free to sign up for our private preview to get started with your AI project.