NLP - Advanced - 15 min

Learn RAG — Retrieval Augmented Generation

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

An LLM is frozen at training time — it can't know about events after its training cutoff, or about your private documents, or about facts that change. RAG (Retrieval Augmented Generation) solves this elegantly: when a question comes in, first SEARCH a database of relevant documents, then INSERT those documents into the prompt, then let the LLM read and answer. The model's reasoning ability is combined with up-to-date, domain-specific knowledge — without retraining anything.

The Standard RAG Pipeline

Two phases — Indexing (offline) and Query (online):

INDEXING (run once when documents change):
  1. Split each document into chunks (typically 200-1000 tokens with overlap)
  2. Embed each chunk using an embedding model → 384-3072 dim vector
  3. Store (chunk_text, embedding, metadata) in a vector database
     (Pinecone, Weaviate, Qdrant, Chroma, FAISS, pgvector...)

QUERY (run on every user question):
  1. Embed the user query using the same embedding model
  2. Vector DB returns the top-k chunks with smallest cosine distance to the query (k = 3-10 typical)
  3. Construct prompt:
        [System: 'Answer using ONLY the context below.']
        [Context: chunk_1 chunk_2 chunk_3 ...]
        [User: question]
  4. Send to LLM → answer
  5. Optionally: include source citations in the response

Two-step retrieval + generation · keeps the LLM grounded in real documents

Embedding Models — Quietly Important

  • OpenAI text-embedding-3 (small/large): 1536/3072-dim, strong general-purpose, paid API.
  • Cohere embed-v3: multilingual, strong on English/code, paid API.
  • BGE / E5 (open-source): rivals paid models, easy to self-host.
  • Sentence-BERT family: classical workhorses, smaller (384-768 dim).
  • Choose based on: (1) language coverage, (2) document length tolerance, (3) cost, (4) self-host vs API, (5) MTEB benchmark scores for your domain.

Chunking — the Quiet Trap

Documents are too long to embed whole — must be split into chunks.
Choice of chunk size and boundaries massively affects retrieval quality.

Fixed-size chunking (simplest):
  • 500 tokens with 50-token overlap is a common baseline
  • Risk: splits sentences mid-thought, misses cross-paragraph context

Semantic chunking:
  • Split at section/paragraph boundaries
  • Keep semantically coherent units together
  • More complex but better retrieval quality

Hierarchical chunking:
  • Index both small chunks (precise retrieval) and large chunks (more context)
  • At query time, retrieve small chunks but feed the LLM the parent large chunks
  • Best of both worlds; used in production systems like LangChain's ParentDocumentRetriever

Rule of thumb: chunks should be self-contained enough that the LLM can answer using JUST that chunk's text.

Smaller chunks = more precise retrieval but less context · tune to your domain

Why RAG Beats Fine-Tuning for Knowledge

  • Fresh data: add a new document to the DB and it's available immediately. Fine-tuning would require a re-train.
  • Citations: you can show users WHERE the answer came from. Fine-tuning blends information with no provenance.
  • Compliance: easy to delete data (just remove from DB). Fine-tuned weights cannot be 'unlearned' easily.
  • Cheaper: fine-tuning a 7B model costs hundreds of dollars per run. Adding embeddings to a vector DB costs cents.
  • Composable: same LLM serves many domains via different RAG indexes. Fine-tuning ties a model to one domain.
  • When to fine-tune anyway: when you need to teach the model NEW BEHAVIOURS (tone, format, reasoning style). RAG can't change how the model thinks, only what facts it has access to.

Common RAG Failure Modes

  • Retrieval misses the right chunk → wrong answer. Fix: better chunking, hybrid search (vector + BM25), reranking, query rewriting.
  • Top-k chunks are off-topic but model still tries to answer → hallucination. Fix: explicit 'if not in context, say I don't know' instruction.
  • Conflicting information across chunks → confused answer. Fix: rerank by recency or authority; add metadata filters.
  • Chunks lack context (e.g., 'he said yes' — who?). Fix: store context-aware chunks via hierarchical indexing.
  • Long-tail queries return generic chunks → low quality. Fix: query expansion / hypothetical document embeddings (HyDE).
  • User asks something unanswerable from the corpus → hallucinated answer. Fix: confidence thresholds, refusal templates.

Practice questions

  1. What problem does RAG primarily solve that fine-tuning struggles with?
  2. What is the role of the embedding model in a RAG system?
  3. Why does chunking strategy matter so much in RAG?
  4. Compared to fine-tuning, what is RAG's main weakness?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More NLP lessons