Large Language Models (LLMs) are powerful tools for generating natural language responses, but they sometimes hallucinate or give incorrect answers. This limitation becomes a challenge when we need factual, grounded information—especially in real-world applications. That’s where Retrieval-Augmented Generation (RAG) steps in.
RAG combines traditional information retrieval with language generation. Instead of relying purely on the LLM’s internal knowledge, RAG systems search external documents and ground their responses in that retrieved context. This boosts accuracy, reduces hallucinations, and makes it possible to “chat with your own data.”
To build a RAG pipeline, we:
Let’s walk through a simple RAG pipeline that answers a question about book sales using a local model setup.
from langchain.llms import LlamaCpp
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
# Load quantized local generation model
llm = LlamaCpp(
model_path="your_model_path.gguf",
n_gpu_layers=-1,
max_tokens=512,
n_ctx=2048,
seed=42
)
# Load embedding model for text similarity
embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-small")from langchain.vectorstores import FAISS
texts = [
"The novel sold over 1.2 million copies globally in its first year.",
"Book sales in Europe accounted for nearly 40% of the total.",
"The author signed a deal for a sequel in early 2024."
]
# Embed and store documents
db = FAISS.from_texts(texts, embedding_model)
3. Prompt Template
from langchain.prompts import PromptTemplate
template = """<|user|>
Relevant details:
{context}
Answer the following question using the above information:
{question}<|end|>
<|assistant|>"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])from langchain.chains import RetrievalQA
rag = RetrievalQA.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=db.as_retriever(),
chain_type_kwargs={"prompt": prompt}
)
response = rag.invoke("How many copies did the book sell?")
print(response)Reformat verbose or unclear questions to better guide retrieval.
Example
Original: “I like dolphins. Where do they live?”
Rewritten: “Where do dolphins live?”
Break down complex queries into multiple focused searches.
Example
Query: “Compare Apple’s revenue in 2020 vs. 2023.”
Split into:
Perform sequential searches to answer compound questions.
Example
Step 1: “Top car manufacturers 2023”
Step 2: “Does Toyota make EVs?”, etc.
Direct queries to different sources based on topic (e.g., HR data → Notion, Customer data → CRM).
Treat the LLM like an agent that not only retrieves from multiple tools but can also act (e.g., post updates to Notion).
Evaluation of RAG models involves both human judgment and automated tools. Key metrics include:
Automated tools like Ragas also measure: