Login Sign Up

Retrieval-Augmented Generation (RAG): Supercharging LLMs with Search

Large Language Models (LLMs) are powerful tools for generating natural language responses, but they sometimes hallucinate or give incorrect answers. This limitation becomes a challenge when we need factual, grounded information—especially in real-world applications. That’s where Retrieval-Augmented Generation (RAG) steps in.

What is RAG?

RAG combines traditional information retrieval with language generation. Instead of relying purely on the LLM’s internal knowledge, RAG systems search external documents and ground their responses in that retrieved context. This boosts accuracy, reduces hallucinations, and makes it possible to “chat with your own data.”

Real-World Examples

  • Search engines like Bing AI, Perplexity, and Google Gemini now use RAG-style techniques to generate informative summaries based on search results.
  • Enterprise tools use RAG to let users query internal documentation, CRM data, or even specific books.

From Search to RAG

To build a RAG pipeline, we:

  1. Retrieve: Search for top relevant documents using an embedding-based similarity search.
  2. Generate: Feed those documents to the LLM to generate a grounded answer.

Example: Book Sales with RAG (Local Setup)

Let’s walk through a simple RAG pipeline that answers a question about book sales using a local model setup.

1. Setup: Load Models

from langchain.llms import LlamaCpp

from langchain.embeddings.huggingface import HuggingFaceEmbeddings

# Load quantized local generation model

llm = LlamaCpp(

    model_path="your_model_path.gguf",

    n_gpu_layers=-1,

    max_tokens=512,

    n_ctx=2048,

    seed=42

)

# Load embedding model for text similarity

embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-small")

2. Create a Vector Store

from langchain.vectorstores import FAISS

texts = [

    "The novel sold over 1.2 million copies globally in its first year.",

    "Book sales in Europe accounted for nearly 40% of the total.",

    "The author signed a deal for a sequel in early 2024."

]

# Embed and store documents

db = FAISS.from_texts(texts, embedding_model)

3. Prompt Template

from langchain.prompts import PromptTemplate

template = """<|user|>

Relevant details:

{context}

Answer the following question using the above information:

{question}<|end|>

<|assistant|>"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

4. RAG Pipeline

from langchain.chains import RetrievalQA

rag = RetrievalQA.from_chain_type(

    llm=llm,

    chain_type='stuff',

    retriever=db.as_retriever(),

    chain_type_kwargs={"prompt": prompt}

)

response = rag.invoke("How many copies did the book sell?")

print(response)

Advanced RAG Techniques

1. Query Rewriting

Reformat verbose or unclear questions to better guide retrieval.

Example
Original: “I like dolphins. Where do they live?”
Rewritten: “Where do dolphins live?”

2. Multi-query RAG

Break down complex queries into multiple focused searches.

Example
Query: “Compare Apple’s revenue in 2020 vs. 2023.”
Split into:

  • “Apple revenue 2020”
  • “Apple revenue 2023”

3. Multi-hop RAG

Perform sequential searches to answer compound questions.

Example
Step 1: “Top car manufacturers 2023”
Step 2: “Does Toyota make EVs?”, etc.

4. Query Routing

Direct queries to different sources based on topic (e.g., HR data → Notion, Customer data → CRM).

5. Agentic RAG

Treat the LLM like an agent that not only retrieves from multiple tools but can also act (e.g., post updates to Notion).

Evaluating RAG Systems

Evaluation of RAG models involves both human judgment and automated tools. Key metrics include:

  • Fluency: Is the response natural and readable?
  • Utility: Is it useful and informative?
  • Citation Recall: Are all facts backed by sources?
  • Citation Precision: Are all sources relevant?

Automated tools like Ragas also measure:

  • Faithfulness: Does the answer stay true to retrieved content?
  • Answer Relevance: Is it actually answering the question?