Retrieval-Augmented Generation

RAG extends a language model’s knowledge by retrieving relevant documents at inference time and including them in the context window.

RAG pipeline

The Basic Pipeline

Index documents into a vector store (embed each chunk, store embedding + text)
At query time, embed the user’s question and retrieve the top-k most similar chunks
Pass the retrieved chunks + query to the model as context

# Simplified RAG pipeline
def rag_query(query: str, vector_store, llm, k: int = 5) -> str:
    # Step 1: retrieve
    query_embedding = embed(query)
    chunks = vector_store.search(query_embedding, top_k=k)

    # Step 2: build context
    context = "\n\n".join(chunk.text for chunk in chunks)

    # Step 3: generate
    prompt = f"""Answer the question using only the context below.
If the context doesn't contain the answer, say so.

Context:
{context}

Question: {query}
Answer:"""
    return llm.complete(prompt)

Chunking Strategy

Chunk size matters. Too small: chunks lose context. Too large: irrelevant content dilutes the relevant. 256–512 tokens with overlap (50–100 tokens) is a common starting point.

Strategy	Chunk size	Overlap	Good for
Fixed token	256–512	50–100	General purpose
Sentence	~1–3 sentences	1 sentence	High-precision retrieval
Paragraph	~200–400 tokens	None	Narrative/prose documents
Recursive	Variable	Variable	Mixed-structure documents

Overlap prevents splitting a relevant sentence across chunk boundaries.

Embedding models

The quality of retrieval depends heavily on the embedding model.

OpenAI text-embedding-3-small/large — strong baseline, hosted
sentence-transformers (BAAI/bge-m3) — strong open-source option, multilingual
Cohere embed-v3 — high-quality with reranking support

Use a model that was trained on data similar to your domain. A general-purpose embedding model may perform poorly on legal, medical, or code-heavy corpora.

Improving retrieval quality

Hybrid search

Combine dense (vector) search with sparse (BM25/keyword) search. Dense search handles semantic similarity; sparse search handles exact keyword matches. A reranker then scores the merged results.

Reranking

After retrieving top-k chunks, a cross-encoder reranker scores each chunk against the query more precisely than the embedding similarity allows. Adds latency but improves precision significantly.

Initial retrieval: top-20 chunks by vector similarity
Reranker: score each of 20 chunks against query
Final context: top-5 chunks by reranker score

Query transformation

The user’s raw query may not match the language of the documents. Rewrite it before embedding:

Expand abbreviations or jargon
Generate multiple query phrasings and retrieve for all
Use HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, embed it, use that embedding to retrieve

When RAG Beats Fine-Tuning

Scenario	RAG	Fine-tuning
Knowledge changes frequently	✓ (no retraining)	✗ (requires retraining)
Need citations/provenance	✓ (chunks are attributable)	✗
Task-specific style/format	✗	✓
Domain-specific reasoning	Sometimes	✓
Large, sparse knowledge base	✓	✗ (all knowledge in weights)

RAG and fine-tuning are complementary. Fine-tune for the task shape; use RAG for the knowledge.