π§ COMPLETE RAG (Retrieval-Augmented Generation) ROADMAP
From Zero to Production β Build Your Own Model & Services
Introduction
What is RAG?
RAG (Retrieval-Augmented Generation) is an AI architecture that enhances Large Language Models (LLMs) by connecting them to external knowledge bases at inference time. Instead of relying solely on parametric memory (what's baked into model weights), RAG retrieves relevant documents and feeds them as context β producing grounded, accurate, up-to-date answers.
π Table of Contents
- Foundation & Prerequisites
- Core Concepts & Theory
- Structured Learning Path
- Algorithms, Techniques & Tools
- Design & Development Process
- Working Principles, Architectures & Hardware
- Advanced RAG Patterns
- Building Your Own RAG Service
- Cutting-Edge Developments
- Project Ideas: Beginner to Advanced
- Reverse Engineering Existing Systems
- Production Deployment & MLOps
- Research Papers & Resources
1. FOUNDATION & PREREQUISITES
1.1 Mathematics Foundations
- Linear Algebra
- Vectors, matrices, tensors
- Dot products and cosine similarity (critical for retrieval)
- Matrix multiplication (used in attention mechanisms)
- Eigenvalues, SVD (Singular Value Decomposition β used in LSA)
- Vector spaces and subspaces
- Probability & Statistics
- Probability distributions (Gaussian, Bernoulli, Categorical)
- Bayes' Theorem (foundational for probabilistic retrieval)
- Entropy, KL Divergence, Cross-Entropy (loss functions)
- Maximum Likelihood Estimation
- Expectation-Maximization (EM algorithm)
- Calculus
- Derivatives and gradients
- Chain rule (backpropagation)
- Gradient descent and its variants
- Partial derivatives and Jacobians
- Information Theory
- Shannon entropy
- Mutual information
- TF-IDF derivation (term frequency-inverse document frequency)
- Information gain
1.2 Programming Prerequisites
- Python (Primary language)
- Object-oriented programming
- Async/await, coroutines
- Type hints and dataclasses
- Context managers and decorators
- Generator functions (for streaming)
- Data Structures & Algorithms
- Hash maps, trees, heaps
- k-d trees and ball trees (for ANN search)
- Graph algorithms (for Knowledge Graphs)
- Priority queues
- Software Engineering
- REST API design (FastAPI, Flask)
- Microservices architecture
- Docker and containerization
- Git version control
- Unit testing and integration testing
1.3 Machine Learning Fundamentals
- Supervised vs unsupervised learning
- Neural networks: perceptrons, activation functions
- Backpropagation and optimization
- Overfitting, regularization, dropout
- Tokenization and vocabulary
- Word embeddings (Word2Vec, GloVe, FastText)
- Evaluation metrics: Precision, Recall, F1, NDCG, MRR
1.4 Deep Learning Foundations
- Recurrent Neural Networks (RNNs, LSTMs, GRUs)
- Convolutional Neural Networks (CNNs for text)
- Attention mechanisms (Bahdanau, Luong)
- Encoder-Decoder architectures
- Transfer learning and fine-tuning
- PyTorch or TensorFlow fundamentals
2. CORE CONCEPTS & THEORY
2.1 The Problem RAG Solves
| Problem | RAG Solution |
|---|---|
| LLM hallucination | Ground answers in retrieved facts |
| Knowledge cutoff | Connect to live/updated databases |
| Domain specificity | Index private documents |
| Explainability | Show source documents |
| Token limit | Retrieve only relevant chunks |
2.2 RAG vs. Fine-Tuning vs. In-Context Learning
Fine-Tuning:
β
Learns new behaviors and styles
β Expensive, static knowledge, hallucination risk
Use when: Changing model's tone/behavior/reasoning style
RAG:
β
Dynamic knowledge, citable, cheap updates
β Retrieval latency, chunk quality dependency
Use when: Need current/domain-specific factual recall
In-Context Learning (Prompt Engineering):
β
Zero training cost
β Context window limits, no persistence
Use when: Simple tasks, short documents
Hybrid (RAG + Fine-Tuning):
β
Best of both worlds
Use when: Production systems requiring both accuracy and style
2.3 Core RAG Pipeline Components
[User Query]
β
[Query Processing] β cleaning, expansion, rewriting
β
[Retriever] ββββββββ [Vector Store / Index]
β β
[Re-Ranker] [Document Ingestion Pipeline]
β
[Context Assembly]
β
[Generator (LLM)]
β
[Post-Processing]
β
[Response + Citations]
3. STRUCTURED LEARNING PATH
π PHASE 0 β Orientation (Week 1-2)
- Read: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020)
- Understand the original Facebook RAG paper
- Run a toy RAG demo with LangChain + OpenAI
- Understand what a vector embedding is visually
π PHASE 1 β Text Processing & Embeddings (Week 3-6)
1A. Text Preprocessing
- Tokenization
- Whitespace tokenization
- BPE (Byte-Pair Encoding) β used in GPT
- WordPiece β used in BERT
- SentencePiece β used in T5, LLaMA
- Unigram Language Model tokenization
- Special tokens: [CLS], [SEP], [PAD], [UNK], [MASK]
- Text Cleaning
- HTML/Markdown stripping
- Unicode normalization
- Stopword removal (context-dependent)
- Lemmatization vs stemming
- Named Entity Recognition (NER) for metadata extraction
- Document Chunking Strategies
- Fixed-size chunking (naive, 256/512 tokens)
- Sentence-based chunking (NLTK, SpaCy)
- Paragraph-based chunking
- Semantic chunking (split on embedding similarity drops)
- Recursive character text splitting (LangChain default)
- Document-aware chunking (respect headings, tables)
- Sliding window with overlap (e.g., 512 tokens, 50 token overlap)
- Agentic chunking (use LLM to determine chunk boundaries)
1B. Embeddings
- Sparse Embeddings (Traditional)
- Bag of Words (BoW)
- TF-IDF (Term FrequencyβInverse Document Frequency)
- BM25 (Best Match 25) β still gold standard for keyword search
- BM25+ and BM25L variants
- SPLADE (Sparse Lexical and Expansion model)
- Dense Embeddings (Neural)
- Word-level: Word2Vec (CBOW, Skip-gram), GloVe, FastText
- Sentence-level: InferSent, Universal Sentence Encoder
- Transformer-based: BERT, RoBERTa, ALBERT
- Bi-encoder architecture (query and doc encoded separately)
- Cross-encoder architecture (query + doc encoded together)
- State-of-the-Art Embedding Models
- `text-embedding-3-large` (OpenAI, 3072-dim)
- `text-embedding-ada-002` (OpenAI, 1536-dim)
- `all-MiniLM-L6-v2` (Sentence-Transformers, fast)
- `BAAI/bge-large-en-v1.5` (BGE family, SOTA open-source)
- `BAAI/bge-m3` (multilingual, multi-granularity)
- `Cohere embed-v3` (float, int8, binary quantization)
- `E5-mistral-7b-instruct` (LLM-based embeddings)
- `NV-Embed-v2` (NVIDIA, highest MTEB scores as of 2024)
- `voyage-3` (Voyage AI)
- `nomic-embed-text-v1.5` (open, 8192 context)
- `mxbai-embed-large-v1` (Mixedbread)
- Embedding Evaluation
- MTEB (Massive Text Embedding Benchmark)
- BEIR (Benchmarking IR)
- Tasks: classification, clustering, retrieval, STS, summarization
- Fine-tuning Embeddings
- Contrastive learning (SimCSE)
- Triplet loss: anchor, positive, negative
- In-batch negatives
- Hard negative mining
- MNRL (Multiple Negatives Ranking Loss)
- Matryoshka Representation Learning (MRL) β embeddings at multiple dimensions
π PHASE 2 β Vector Databases & Indexing (Week 7-10)
2A. Similarity Search Fundamentals
- Distance Metrics
- Cosine similarity: `cos(ΞΈ) = (AΒ·B) / (||A|| ||B||)`
- Euclidean (L2) distance
- Manhattan (L1) distance
- Dot product (inner product)
- Hamming distance (binary vectors)
- Jaccard similarity (set-based)
- Exact vs Approximate Nearest Neighbor (ANN)
- Exact search: brute force O(nΒ·d) β only for <100k vectors
- ANN: trade tiny accuracy loss for massive speed gains
- Recall@K as evaluation metric
2B. ANN Indexing Algorithms
- Tree-based
- KD-Tree (fails in high dimensions, >20D)
- Ball Tree
- Random Projection Trees (Annoy)
- ANNOY (Spotify) β forest of random trees
- Hash-based
- Locality Sensitive Hashing (LSH)
- MinHash LSH (for Jaccard)
- SimHash (for cosine)
- Multi-probe LSH
- Graph-based (Best for High-Dim Dense Vectors)
- NSW (Navigable Small World graphs)
- HNSW (Hierarchical Navigable Small World) β de facto standard
- Build: insert nodes, connect to nearest neighbors at each layer
- Query: greedy search from top layer down
- Parameters: M (connections per node), efConstruction, efSearch
- DiskANN (Microsoft) β disk-based, billion-scale
- VAMANA (DiskANN underlying algorithm)
- NGT (Yahoo Japan)
- Quantization-based
- Product Quantization (PQ)
- Split vector into sub-vectors, quantize each
- 32x compression with ~95% recall
- Scalar Quantization (SQ) β int8, int4 quantization
- Binary Quantization (BQ) β 1-bit, 32x compression
- Optimized Product Quantization (OPQ)
- FAISS (Facebook AI Similarity Search)
- IVF (Inverted File Index) + PQ: `IndexIVFPQ`
- Flat index: `IndexFlatL2`, `IndexFlatIP`
- GPU FAISS for billion-scale
- Product Quantization (PQ)
2C. Vector Databases (Production-Grade)
| Database | Type | Best For | Hosted |
|---|---|---|---|
| Pinecone | Managed | Production, ease of use | β Cloud only |
| Weaviate | Open-source + Cloud | Hybrid search, modules | β /π§ Both |
| Qdrant | Open-source + Cloud | Performance, filtering | β /π§ Both |
| Milvus/Zilliz | Open-source + Cloud | Billion-scale | β /π§ Both |
| Chroma | Open-source | Local dev, prototyping | π§ Self-hosted |
| pgvector | PostgreSQL ext | Existing Postgres users | π§ Self-hosted |
| Redis Vector | Redis extension | Low-latency, caching | β /π§ Both |
| OpenSearch | Open-source | Full text + vector hybrid | π§ Self-hosted |
| Elasticsearch | Open-source | Enterprise search | β /π§ Both |
| LanceDB | Embedded | Serverless, local | π§ Both |
| Vespa | Open-source | Complex ranking, ML | π§ Self-hosted |
- Metadata Filtering
- Pre-filtering (filter then search)
- Post-filtering (search then filter)
- Filtered HNSW (Qdrant's approach)
- Payload indexing
- Composite filtering (AND, OR, NOT, range queries)
- Hybrid Search
- Combining dense (vector) + sparse (keyword) results
- Reciprocal Rank Fusion (RRF): `score = Ξ£ 1/(k + rank_i)`
- Weighted sum fusion
- Learned sparse models: SPLADE, SPLADEv2, uniCOIL
π PHASE 3 β Retrieval Strategies (Week 11-14)
3A. Basic Retrieval
- Single-stage dense retrieval
- BM25 keyword retrieval
- Hybrid: BM25 + Dense (most common production setup)
3B. Advanced Retrieval Techniques
- Query Transformation
- Query expansion (add synonyms, related terms)
- HyDE (Hypothetical Document Embeddings)
- Generate a hypothetical answer β embed it β retrieve similar docs
- Multi-Query Retrieval (generate 3-5 query variants)
- Step-back prompting (abstract to higher level)
- Query decomposition (break complex query into sub-queries)
- FLARE (Forward-Looking Active Retrieval)
- Retrieval Modes
- Naive RAG: retrieve top-k β concatenate β generate
- Sentence Window Retrieval: embed sentences, return surrounding window
- Auto-merging Retrieval (LlamaIndex): hierarchical chunks
- Parent-Child Retrieval: embed small chunks, return parent
- Recursive Retrieval: retrieve β generate β retrieve again
- Iterative RAG: multi-hop retrieval for complex questions
- Re-ranking (Critical for Precision)
- Cross-encoders: encode query+doc together, expensive but accurate
- `cross-encoder/ms-marco-MiniLM-L-6-v2`
- `BAAI/bge-reranker-large`
- Cohere Rerank API
- `Jina Reranker`
- ColBERT (Late Interaction)
- Encode query and doc separately, token-level interaction
- `MaxSim` operator for scoring
- RAGatouille library for easy use
- LLM-based Reranking
- RankGPT: use LLM to rank passages
- PairwiseRanker
- Listwise ranking with LLMs
- Learning-to-Rank (LTR)
- Pointwise, pairwise, listwise approaches
- LambdaMART, XGBoost LTR
- Cross-encoders: encode query+doc together, expensive but accurate
π PHASE 4 β Generation & LLMs (Week 15-20)
4A. Understanding LLMs
- Transformer Architecture Deep Dive
- Self-attention: `Attention(Q,K,V) = softmax(QK^T / βd_k)V`
- Multi-head attention (MHA)
- Grouped Query Attention (GQA) β used in LLaMA 3
- Multi-Query Attention (MQA) β faster inference
- Feed-forward layers (SwiGLU, GeGLU activations)
- Positional encodings: sinusoidal, RoPE, ALiBi
- Layer Normalization (Pre-LN vs Post-LN)
- KV-Cache mechanism
- Decoder-only Models (Generation)
- GPT family (OpenAI): GPT-4o, GPT-4-turbo
- LLaMA family (Meta): LLaMA 2, LLaMA 3, LLaMA 3.1
- Mistral family: Mistral 7B, Mixtral 8x7B (MoE)
- Gemma (Google): Gemma 2B, 7B, 27B
- Phi family (Microsoft): Phi-3, Phi-3.5
- Qwen (Alibaba): Qwen2, Qwen2.5
- Command-R (Cohere) β RAG-optimized
- DeepSeek-V2, V3 β MoE architecture
- Encoder-Decoder Models
- T5, Flan-T5
- BART, mBART
- Original RAG paper used BART as generator
4B. Inference Optimization
- Quantization
- GPTQ (Post-Training Quantization, weight-only)
- AWQ (Activation-aware Weight Quantization)
- GGUF (llama.cpp format) β Q4_K_M, Q5_K_M, Q8_0
- bitsandbytes (8-bit, 4-bit via NF4)
- HQQ (Half-Quadratic Quantization)
- FP8 training and inference (H100 native)
- Serving Frameworks
- vLLM (PagedAttention, continuous batching, highest throughput)
- llama.cpp (CPU inference, GGUF format)
- Ollama (local LLM server, easy setup)
- TGI β Text Generation Inference (HuggingFace)
- TensorRT-LLM (NVIDIA, fastest on A100/H100)
- LMDeploy (InternLM)
- SGLang (structured generation, fast)
- Context Length & Memory
- FlashAttention (memory-efficient attention, 2-4x speedup)
- FlashAttention-2, FlashAttention-3
- Sliding window attention (Mistral)
- Ring attention (distributed long context)
- Paged KV-Cache (vLLM)
- GQA/MQA for KV-cache reduction
- Speculative Decoding (draft model speeds up large model)
4C. Prompt Engineering for RAG
- System prompt design for RAG
- Context window budget allocation
- Citation/grounding instruction prompting
- Chain-of-thought (CoT) for multi-hop
- Few-shot RAG examples
- Structured output prompting (JSON mode)
- Handling "I don't know" responses
- Confidence calibration prompting
4D. Fine-tuning for RAG
- SFT (Supervised Fine-Tuning)
- Format: `[System] [Retrieved Context] [Query] β [Answer]`
- Datasets: NarrativeQA, QuALITY, SQUAD, HotpotQA
- Tools: HuggingFace TRL, Axolotl, LLaMA-Factory
- Parameter-Efficient Fine-Tuning (PEFT)
- LoRA (Low-Rank Adaptation): `W = Wβ + BA` (B, A are low-rank)
- QLoRA (4-bit quantized LoRA)
- AdaLoRA (adaptive rank)
- LoftQ (quantization-aware LoRA init)
- IAΒ³ (Infused Adapter by Inhibiting and Amplifying)
- RLHF for RAG
- PPO (Proximal Policy Optimization)
- DPO (Direct Preference Optimization) β simpler, no reward model
- RLAIF (RL from AI Feedback)
- Constitutional AI (Anthropic)
π PHASE 5 β Evaluation & Observability (Week 21-24)
5A. RAG Evaluation Metrics
- Retrieval Quality
- Context Precision: relevant docs / retrieved docs
- Context Recall: retrieved relevant docs / all relevant docs
- MRR (Mean Reciprocal Rank): `MRR = (1/|Q|) Ξ£ 1/rank_i`
- NDCG@K (Normalized Discounted Cumulative Gain)
- Hit Rate@K
- Generation Quality
- Faithfulness: is the answer grounded in context?
- Answer Relevance: does the answer address the question?
- Answer Correctness: factual accuracy vs ground truth
- BLEU, ROUGE (reference-based, less useful for open-ended)
- BERTScore (semantic similarity)
- G-Eval (LLM-as-judge)
- Ragas Score (RAG-specific composite metric)
- End-to-End Metrics
- Answer Similarity
- Semantic Answer Similarity (SAS)
- ARES (Automated RAG Evaluation System)
- RAGAS (open-source RAG evaluation framework)
- TruLens (evaluation + tracking)
- DeepEval (unit testing for LLMs)
5B. Observability & Monitoring
- Tracing Tools
- LangSmith (LangChain native)
- Phoenix (Arize AI)
- Langfuse (open-source)
- W&B Weave (Weights & Biases)
- Helicone
- OpenTelemetry for custom tracing
- Key Metrics to Monitor
- Latency (P50, P90, P99) per pipeline stage
- Token usage and cost
- Retrieval success rate
- Hallucination rate (using LLM judge)
- User feedback signals (thumbs up/down)
- Cache hit rate
4. ALGORITHMS, TECHNIQUES & TOOLS
4.1 Complete Algorithm Reference
Retrieval Algorithms
Sparse:
βββ BM25 (Robertson & Zaragoza, 2009) β most used baseline
βββ BM25+ / BM25L β improved variants
βββ TF-IDF
βββ SPLADE (Formal et al., 2021) β learned sparse
βββ uniCOIL β sparse with BERT
βββ DeepImpact β learned doc-side weights
Dense:
βββ DPR (Dense Passage Retrieval) β Facebook, 2020
βββ ANCE (Approximate Nearest Neighbor Negative Contrastive)
βββ E5 (EmbEddings from bidirEctional Encoder rEpresentations)
βββ BGE (BAAI General Embeddings)
βββ GTE (General Text Embeddings, Alibaba)
βββ SimCSE (Simple Contrastive Sentence Embeddings)
Hybrid:
βββ RRF (Reciprocal Rank Fusion)
βββ Linear interpolation: score = Ξ±*dense + (1-Ξ±)*sparse
βββ PLAID (ColBERT-based efficient retrieval)
βββ Learned hybrid weights
Multi-hop:
βββ MDR (Multi-hop Dense Retrieval)
βββ Baleen (conditioned retrieval)
βββ FLARE (Forward-Looking Active Retrieval Augmentation)
βββ IRCoT (Interleaving Retrieval with Chain-of-Thought)
Reranking Algorithms
Cross-Encoders:
βββ MonoBERT (point-wise)
βββ MonoT5 (seq2seq reranker)
βββ DuoBERT (pair-wise)
βββ RankLLaMA
Late Interaction:
βββ ColBERT (Khattab & Zaharia, 2020)
βββ ColBERTv2 (residual compression)
βββ PLAID (efficient ColBERT)
βββ Col-BERT-QA
LLM-based:
βββ RankGPT (Sun et al., 2023)
βββ PRP (Pairwise Ranking Prompting)
βββ LRL (Listwise Reranker)
βββ Setwise ranking
Generation Algorithms
Decoding Strategies:
βββ Greedy decoding
βββ Beam search (width B)
βββ Top-K sampling
βββ Top-P (nucleus) sampling
βββ Temperature scaling
βββ Repetition penalty
βββ Contrastive decoding
βββ Speculative decoding
RAG-specific:
βββ Token-level RAG (RETRO-style)
βββ Fusion-in-Decoder (FiD)
βββ REALM (Retrieval-Enhanced Language Model)
βββ kNN-LM (k-nearest neighbors LM)
βββ Adaptive Retrieval (decide when to retrieve)
4.2 Complete Tools Ecosystem
Data Processing
Parsing & Loading:
βββ LlamaParse (advanced PDF parsing, tables, figures)
βββ Unstructured.io (20+ file types)
βββ PyPDF2, pdfplumber, pdfminer
βββ Docling (IBM, multi-format)
βββ Marker (PDF β Markdown, open-source)
βββ Camelot, Tabula (table extraction)
βββ Beautiful Soup, Scrapy (web scraping)
βββ Pandoc (document conversion)
βββ Apache Tika (enterprise parsing)
Chunking:
βββ LangChain TextSplitters (RecursiveCharacterTextSplitter)
βββ LlamaIndex NodeParsers (SentenceWindowNodeParser)
βββ Semantic chunking (sentence-transformers based)
βββ NLTK (sentence tokenization)
βββ SpaCy (NLP pipeline, sentence boundaries)
βββ chonkie (fast chunking library)
Orchestration Frameworks
High-Level:
βββ LangChain β most popular, broad ecosystem
βββ LlamaIndex β best for document RAG, indexing strategies
βββ Haystack (deepset) β production-focused
βββ DSPy (Stanford) β programmatic LLM pipelines
βββ AutoGen (Microsoft) β multi-agent
βββ CrewAI β role-based multi-agent
Low-Level (more control):
βββ Direct API calls (OpenAI, Anthropic, Together)
βββ HuggingFace Transformers + Datasets
βββ Instructor (structured outputs)
βββ Guidance (constrained generation)
Agentic RAG:
βββ LangGraph (stateful agent graphs)
βββ LlamaIndex Workflows
βββ Phidata
βββ Pydantic AI
LLM Access
API Providers:
βββ OpenAI (GPT-4o, o1, o3)
βββ Anthropic (Claude 3.5 Sonnet, Haiku)
βββ Google (Gemini 1.5 Pro, Flash)
βββ Cohere (Command-R+, Rerank)
βββ Mistral AI (Mistral Large, Codestral)
βββ Together AI (open models)
βββ Fireworks AI (fast inference)
βββ Groq (ultra-fast LPU inference)
βββ Perplexity AI
Self-Hosted:
βββ Ollama (easiest local setup)
βββ vLLM (production, high throughput)
βββ llama.cpp (CPU friendly)
βββ LM Studio (GUI for local models)
βββ Jan.ai (desktop app)
Backend & APIs
Web Frameworks:
βββ FastAPI (recommended, async, auto-docs)
βββ Flask (simpler)
βββ Django (full-stack)
βββ Starlette (low-level async)
Databases:
βββ PostgreSQL + pgvector
βββ SQLite (dev/embedded)
βββ MongoDB (document store)
βββ Redis (caching, session)
βββ Elasticsearch (full-text)
Message Queues:
βββ Celery + Redis/RabbitMQ
βββ Apache Kafka (high volume)
βββ Bull (Node.js, if polyglot)
Caching:
βββ Redis (semantic cache)
βββ GPTCache (LLM-specific caching)
βββ CDN caching for static chunks
5. DESIGN & DEVELOPMENT PROCESS
5.1 Naive RAG β Scratch to Working System
Step 1: Document Ingestion Pipeline
# COMPLETE INGESTION PIPELINE
import os
from pathlib import Path
from typing import List, Dict, Any
import hashlib
class DocumentIngestionPipeline:
def __init__(self, chunk_size=512, chunk_overlap=50):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def load_documents(self, directory: str) -> List[Dict]:
"""Load all supported documents from directory"""
documents = []
supported = ['.pdf', '.txt', '.md', '.docx', '.html']
for path in Path(directory).rglob('*'):
if path.suffix in supported:
content = self.extract_text(path)
doc_id = hashlib.md5(str(path).encode()).hexdigest()
documents.append({
'id': doc_id,
'content': content,
'metadata': {
'source': str(path),
'filename': path.name,
'file_type': path.suffix,
'created_at': os.path.getctime(path)
}
})
return documents
def chunk_documents(self, documents: List[Dict]) -> List[Dict]:
"""Split documents into overlapping chunks"""
chunks = []
for doc in documents:
text = doc['content']
words = text.split()
for i in range(0, len(words), self.chunk_size - self.chunk_overlap):
chunk_words = words[i:i + self.chunk_size]
chunk_text = ' '.join(chunk_words)
chunk_id = f"{doc['id']}_{i}"
chunks.append({
'id': chunk_id,
'text': chunk_text,
'metadata': {
**doc['metadata'],
'chunk_index': i // (self.chunk_size - self.chunk_overlap),
'char_start': len(' '.join(words[:i])),
}
})
return chunks
Step 2: Embedding & Indexing
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import pickle
class EmbeddingIndexer:
def __init__(self, model_name='BAAI/bge-large-en-v1.5'):
self.model = SentenceTransformer(model_name)
self.index = None
self.chunk_store = {}
self.dim = self.model.get_sentence_embedding_dimension()
def build_index(self, chunks: List[Dict]):
"""Build FAISS HNSW index from chunks"""
texts = [chunk['text'] for chunk in chunks]
# Encode in batches
embeddings = self.model.encode(
texts,
batch_size=32,
show_progress_bar=True,
normalize_embeddings=True # for cosine similarity
)
# Create HNSW index (best for dense retrieval)
self.index = faiss.IndexHNSWFlat(self.dim, 32) # M=32
self.index.hnsw.efConstruction = 200
self.index.add(embeddings.astype('float32'))
# Store chunks for retrieval
for i, chunk in enumerate(chunks):
self.chunk_store[i] = chunk
print(f"Indexed {len(chunks)} chunks")
def search(self, query: str, top_k: int = 5) -> List[Dict]:
"""Retrieve top-k relevant chunks"""
query_embedding = self.model.encode(
[query], normalize_embeddings=True
)
self.index.hnsw.efSearch = 50
distances, indices = self.index.search(
query_embedding.astype('float32'), top_k
)
results = []
for dist, idx in zip(distances[0], indices[0]):
if idx != -1:
chunk = self.chunk_store[idx].copy()
chunk['score'] = float(dist)
results.append(chunk)
return results
Step 3: Generation with Context
from openai import OpenAI
class RAGGenerator:
def __init__(self, model='gpt-4o-mini'):
self.client = OpenAI()
self.model = model
def generate(self, query: str, retrieved_chunks: List[Dict]) -> Dict:
"""Generate answer with retrieved context"""
# Build context string with citations
context_parts = []
for i, chunk in enumerate(retrieved_chunks, 1):
context_parts.append(
f"[Source {i}: {chunk['metadata']['filename']}]\n{chunk['text']}"
)
context = "\n\n---\n\n".join(context_parts)
system_prompt = """You are a precise, helpful assistant. Answer questions
based ONLY on the provided context. If the context doesn't contain
enough information, say "I don't have enough information to answer this."
Always cite which source you used (e.g., [Source 1])."""
user_message = f"""Context:
{context}
Question: {query}
Answer (with citations):"""
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
],
temperature=0.1,
max_tokens=1000
)
return {
'answer': response.choices[0].message.content,
'sources': [c['metadata']['source'] for c in retrieved_chunks],
'usage': response.usage.dict()
}
# Full Pipeline
class NaiveRAG:
def __init__(self):
self.indexer = EmbeddingIndexer()
self.generator = RAGGenerator()
def ingest(self, directory: str):
pipeline = DocumentIngestionPipeline()
docs = pipeline.load_documents(directory)
chunks = pipeline.chunk_documents(docs)
self.indexer.build_index(chunks)
def query(self, question: str, top_k: int = 5) -> Dict:
chunks = self.indexer.search(question, top_k)
return self.generator.generate(question, chunks)
5.2 Advanced RAG β Production System
Advanced Chunking with Semantic Splitting
from sklearn.metrics.pairwise import cosine_similarity
def semantic_chunking(text: str, model, threshold: float = 0.5) -> List[str]:
"""Split text where semantic similarity drops significantly"""
sentences = split_into_sentences(text)
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Compare current sentence to previous
sim = cosine_similarity(
embeddings[i-1:i], embeddings[i:i+1]
)[0][0]
if sim < threshold: # Semantic boundary detected
chunks.append(' '.join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
HyDE (Hypothetical Document Embeddings)
def hyde_retrieval(query: str, llm_client, embedder, index) -> List[Dict]:
"""Generate hypothetical answer to improve retrieval"""
# Generate hypothetical document
hypo_prompt = f"""Write a short, factual paragraph that would directly
answer this question: {query}
Write as if you know the answer. Be specific."""
response = llm_client.chat.completions.create(
model='gpt-4o-mini',
messages=[{"role": "user", "content": hypo_prompt}],
max_tokens=200
)
hypothetical_doc = response.choices[0].message.content
# Encode hypothetical doc (instead of raw query)
hypo_embedding = embedder.encode([hypothetical_doc], normalize_embeddings=True)
# Retrieve using hypothetical embedding
distances, indices = index.search(hypo_embedding.astype('float32'), 5)
return [chunk_store[idx] for idx in indices[0] if idx != -1]
Multi-Query Retrieval
def multi_query_retrieval(query: str, llm_client, retriever) -> List[Dict]:
"""Generate multiple query variants for diverse retrieval"""
prompt = f"""Generate 4 different search queries to find information about:
"{query}"
Return ONLY the queries, one per line, no numbering."""
response = llm_client.chat.completions.create(
model='gpt-4o-mini',
messages=[{"role": "user", "content": prompt}],
max_tokens=200
)
queries = [query] + response.choices[0].message.content.strip().split('\n')
# Retrieve for each query
all_chunks = {}
for q in queries:
results = retriever.search(q, top_k=3)
for chunk in results:
# Deduplicate by chunk ID
all_chunks[chunk['id']] = chunk
# Sort by best score
return sorted(all_chunks.values(), key=lambda x: x['score'], reverse=True)[:5]
Reranking Pipeline
from sentence_transformers import CrossEncoder
def rerank_with_cross_encoder(
query: str,
chunks: List[Dict],
model_name='BAAI/bge-reranker-large'
) -> List[Dict]:
"""Rerank retrieved chunks using cross-encoder"""
reranker = CrossEncoder(model_name)
# Create (query, passage) pairs
pairs = [(query, chunk['text']) for chunk in chunks]
# Score all pairs
scores = reranker.predict(pairs)
# Sort by reranker score
ranked = sorted(
zip(scores, chunks),
key=lambda x: x[0],
reverse=True
)
for score, chunk in ranked:
chunk['rerank_score'] = float(score)
return [chunk for _, chunk in ranked]
5.3 Full Production RAG Architecture (FastAPI)
# main.py β Production RAG Service
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import Optional
import uvicorn
import asyncio
from contextlib import asynccontextmanager
# Models
class QueryRequest(BaseModel):
question: str
top_k: int = 5
use_reranking: bool = True
use_hyde: bool = False
conversation_id: Optional[str] = None
class QueryResponse(BaseModel):
answer: str
sources: list[str]
retrieval_time_ms: float
generation_time_ms: float
chunks_retrieved: int
class IngestRequest(BaseModel):
source_url: Optional[str] = None
content: Optional[str] = None
metadata: Optional[dict] = None
# Global components
rag_components = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
rag_components['retriever'] = HybridRetriever()
rag_components['reranker'] = CrossEncoderReranker()
rag_components['generator'] = RAGGenerator()
print("RAG service ready!")
yield
# Shutdown cleanup
app = FastAPI(
title="RAG Service API",
description="Production RAG service with hybrid retrieval",
lifespan=lifespan
)
app.add_middleware(CORSMiddleware, allow_origins=["*"])
@app.post("/query", response_model=QueryResponse)
async def query_endpoint(request: QueryRequest):
import time
# Retrieval
t0 = time.time()
retriever = rag_components['retriever']
chunks = await retriever.aretrieve(request.question, request.top_k * 2)
retrieval_ms = (time.time() - t0) * 1000
# Reranking
if request.use_reranking:
chunks = rag_components['reranker'].rerank(request.question, chunks)
chunks = chunks[:request.top_k]
# Generation
t1 = time.time()
result = await rag_components['generator'].agenerate(
request.question, chunks
)
gen_ms = (time.time() - t1) * 1000
return QueryResponse(
answer=result['answer'],
sources=result['sources'],
retrieval_time_ms=retrieval_ms,
generation_time_ms=gen_ms,
chunks_retrieved=len(chunks)
)
@app.post("/ingest")
async def ingest_endpoint(request: IngestRequest, background_tasks: BackgroundTasks):
background_tasks.add_task(
rag_components['retriever'].ingest_async,
request.content,
request.metadata
)
return {"status": "ingestion_queued"}
@app.get("/health")
async def health():
return {"status": "healthy", "chunks_indexed": rag_components['retriever'].count()}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
6. WORKING PRINCIPLES, ARCHITECTURES & HARDWARE
6.1 RAG Architecture Variants
A. Naive RAG (2020 β Lewis et al.)
Query β Embed β FAISS Search β Top-K Chunks β LLM β Answer
Pros: Simple, fast, works out of the box
Cons: Retrieval quality limits answer quality
B. Advanced RAG (2023+)
Query β [Rewrite/Expand] β [Hybrid Search] β [Rerank] β [Filtered Context] β LLM
Adds: HyDE, multi-query, re-ranking, context compression
C. Modular RAG (2023+)
Configurable modules:
βββ Search: web search, database, vector, knowledge graph
βββ Memory: short-term (context), long-term (vector store)
βββ Fusion: merge results from multiple sources
βββ Routing: decide which retriever to use
βββ Generator: choose LLM, prompt template
βββ Predict: generate structured outputs
D. Agentic RAG (2024+)
Query β [LLM Agent]
β
[Tool Selection]
βββ Vector Search Tool
βββ Web Search Tool
βββ SQL Query Tool
βββ Calculator Tool
βββ Code Execution Tool
β
[Multi-step Reasoning]
β
[Final Answer]
E. Graph RAG (Microsoft, 2024)
Documents β [Entity Extraction] β [Knowledge Graph]
Query β [Community Detection] β [Graph Traversal] β [Summarization] β Answer
Excellent for: complex, interconnected domains
Tools: Microsoft GraphRAG, Neo4j + LangChain
F. RAPTOR (Tree RAG, 2024)
Chunks β [UMAP + GMM Clustering] β [Summarize Cluster] β Higher-level nodes
β [Cluster summaries] β [Summarize again] β Root node
Multi-level retrieval from leaf to root
Best for: long documents, hierarchical knowledge
G. Corrective RAG (CRAG, 2024)
Query β Retrieve β [Relevance Evaluator]
βββ Relevant: use docs
βββ Ambiguous: refine + web search
βββ Irrelevant: web search + filter
β
Generate Answer
H. Self-RAG (2023)
Query β LLM decides: [Retrieve? Yes/No]
If Yes β Retrieve β LLM critiques: [IsRel? IsSup? IsUse?]
β Generate with self-reflection tokens
β [ISREL] [ISSUP] [ISUSE] special tokens
Best for: adaptive retrieval without always retrieving
6.2 Hardware Requirements
For Development & Prototyping
Minimum (API-based RAG):
βββ CPU: Any modern 4-core CPU
βββ RAM: 16GB
βββ Storage: 50GB SSD
βββ GPU: Not required (using OpenAI/Anthropic APIs)
βββ Network: Stable broadband
βββ Cost: ~$50-200/month (API costs)
Recommended Dev Setup:
βββ CPU: Apple M2/M3 or AMD Ryzen 9
βββ RAM: 32-64GB (for local models)
βββ Storage: 500GB NVMe SSD
βββ GPU: RTX 3090 (24GB VRAM) β run 13B models
βββ OS: Linux (Ubuntu 22.04) or macOS
For Running Local LLMs (Self-Hosted)
Small Models (7B params):
βββ GPU: RTX 3080 (10GB) β quantized (Q4)
βββ RAM: 32GB system RAM
βββ VRAM needed: ~6GB for Q4, ~14GB for FP16
βββ Models: LLaMA 3.1 8B, Mistral 7B, Gemma 7B
Medium Models (13B-30B):
βββ GPU: RTX 3090/4090 (24GB) for Q4
βββ Multi-GPU: 2x RTX 3090 for FP16
βββ VRAM: ~10GB (Q4), ~26GB (FP16)
βββ Models: LLaMA 2 13B, Qwen 14B, Mistral 22B
Large Models (70B):
βββ GPU: 4x A100 80GB or 2x H100 80GB
βββ Or: 4x RTX 4090 (24GB each) with Q4 quantization
βββ VRAM: ~40GB (Q4), ~140GB (FP16)
βββ Models: LLaMA 3.1 70B, Qwen 72B
Frontier (405B+):
βββ GPU: 8x H100 80GB (minimum)
βββ VRAM: ~240GB (Q4), ~810GB (BF16)
βββ Models: LLaMA 3.1 405B
For Production RAG Service (Cloud)
Small Scale (<1000 QPS):
βββ Vector DB: Qdrant on 32GB RAM, 8 cores
βββ LLM: vLLM on 1x A100 40GB
βββ API Server: 4 cores, 16GB RAM
βββ Estimated cost: $1,500-3,000/month
Medium Scale (1000-10,000 QPS):
βββ Vector DB: Pinecone or Qdrant cluster (3 nodes)
βββ LLM: vLLM on 2-4x A100 80GB
βββ API: Auto-scaling ECS/K8s
βββ Cache: Redis cluster
βββ Estimated cost: $5,000-15,000/month
Large Scale (>10,000 QPS):
βββ Vector DB: Milvus cluster or Pinecone enterprise
βββ LLM: TensorRT-LLM on H100 cluster
βββ CDN + Global load balancing
βββ Estimated cost: $30,000+/month
GPU Comparison for LLM Inference
| GPU | VRAM | Bandwidth | FP16 TFLOPS | Best For |
|---|---|---|---|---|
| RTX 4090 | 24GB | 1008 GB/s | 82.6 | Dev, 7B-13B |
| A100 40GB | 40GB | 1555 GB/s | 312 | Production 13B-70B |
| A100 80GB | 80GB | 2039 GB/s | 312 | Production 70B |
| H100 SXM | 80GB | 3350 GB/s | 989 | Frontier models |
| H200 SXM | 141GB | 4800 GB/s | 989 | Largest models |
| MI300X | 192GB | 5300 GB/s | 1307 | AMD alternative |
| Apple M3 Max | 128GB unified | 400 GB/s | ~14 | Local dev, CPU+GPU |
7. ADVANCED RAG PATTERNS
7.1 Agentic RAG with LangGraph
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
class RAGState(TypedDict):
question: str
documents: List[str]
answer: str
generation_count: int
needs_web_search: bool
def grade_documents(state: RAGState) -> RAGState:
"""LLM grades each retrieved document for relevance"""
docs = state['documents']
question = state['question']
relevant_docs = []
for doc in docs:
grade_prompt = f"""Is this document relevant to the question?
Question: {question}
Document: {doc[:500]}
Answer with only: 'yes' or 'no'"""
grade = llm.invoke(grade_prompt).content.strip().lower()
if grade == 'yes':
relevant_docs.append(doc)
# If too few relevant docs, trigger web search
state['documents'] = relevant_docs
state['needs_web_search'] = len(relevant_docs) < 2
return state
# Build the graph
workflow = StateGraph(RAGState)
workflow.add_node("retrieve", retrieve_documents)
workflow.add_node("grade_docs", grade_documents)
workflow.add_node("web_search", web_search_tool)
workflow.add_node("generate", generate_answer)
workflow.add_node("check_hallucination", check_hallucination)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_docs")
workflow.add_conditional_edges(
"grade_docs",
lambda state: "web_search" if state['needs_web_search'] else "generate"
)
workflow.add_edge("web_search", "generate")
workflow.add_edge("generate", "check_hallucination")
workflow.add_conditional_edges(
"check_hallucination",
lambda state: "generate" if state['generation_count'] < 3 else END
)
app = workflow.compile()
7.2 Conversational RAG with Memory
from collections import deque
class ConversationalRAG:
def __init__(self, retriever, generator, max_history=5):
self.retriever = retriever
self.generator = generator
self.conversation_history = deque(maxlen=max_history * 2)
def _build_contextualized_query(self, question: str) -> str:
"""Rewrite query using conversation history"""
if not self.conversation_history:
return question
history_str = "\n".join([
f"{'User' if i%2==0 else 'Assistant'}: {msg}"
for i, msg in enumerate(self.conversation_history)
])
prompt = f"""Given this conversation history:
{history_str}
Rewrite the follow-up question as a standalone question:
Follow-up: {question}
Standalone question:"""
return llm.invoke(prompt).content.strip()
def chat(self, user_message: str) -> str:
# Contextualize query
standalone_q = self._build_contextualized_query(user_message)
# Retrieve
chunks = self.retriever.search(standalone_q, top_k=4)
# Generate with history context
answer = self.generator.generate_with_history(
question=user_message,
chunks=chunks,
history=list(self.conversation_history)
)
# Update history
self.conversation_history.append(user_message)
self.conversation_history.append(answer)
return answer
7.3 Multi-Modal RAG
# Handle images, tables, charts alongside text
class MultiModalRAG:
def __init__(self):
self.text_embedder = SentenceTransformer('BAAI/bge-large-en')
self.image_embedder = CLIPModel.from_pretrained('openai/clip-vit-large-patch14')
self.vision_llm = OpenAI() # GPT-4V
def ingest_pdf_with_images(self, pdf_path: str):
"""Extract text, tables, and images from PDF"""
# Use LlamaParse or Unstructured for extraction
elements = partition_pdf(
filename=pdf_path,
strategy='hi_res',
extract_images_in_pdf=True,
infer_table_structure=True
)
for element in elements:
if element.type == 'Table':
# Convert to markdown, embed as text
table_text = element.metadata.text_as_html
self.index_text(table_text, {'type': 'table'})
elif element.type == 'Image':
# Embed with CLIP, store base64
image_embedding = self.encode_image(element.metadata.image_path)
self.index_image(image_embedding, element.metadata)
else:
self.index_text(element.text, {'type': 'text'})
def query(self, question: str, image=None) -> str:
"""Query with optional image input"""
# Text retrieval
text_chunks = self.text_index.search(question, top_k=3)
# Image retrieval (if query relates to visual content)
if self.is_visual_query(question):
image_chunks = self.image_index.search(question, top_k=2)
# Compose multi-modal context for GPT-4V
messages = self.build_multimodal_prompt(
question, text_chunks, image_chunks
)
return self.vision_llm.chat.completions.create(
model='gpt-4o', messages=messages
).choices[0].message.content
7.4 Knowledge Graph RAG
# Neo4j + LLM for structured knowledge retrieval
from neo4j import GraphDatabase
import spacy
class KnowledgeGraphRAG:
def __init__(self, neo4j_uri, neo4j_auth):
self.driver = GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
self.nlp = spacy.load('en_core_web_lg')
self.llm = OpenAI()
def ingest_to_graph(self, text: str):
"""Extract entities and relationships, store in Neo4j"""
doc = self.nlp(text)
with self.driver.session() as session:
# Create entities
for ent in doc.ents:
session.run(
"MERGE (e:Entity {name: $name, type: $type})",
name=ent.text, type=ent.label_
)
# Create relationships (simplified β use LLM for better extraction)
for sent in doc.sents:
self.extract_and_store_relations(session, sent.text)
def cypher_query_from_nl(self, question: str) -> str:
"""Convert natural language to Cypher using LLM"""
prompt = f"""Convert this question to a Neo4j Cypher query.
Graph has: (Entity {{name, type}}) and [:RELATES_TO {{relation}}] edges.
Question: {question}
Cypher query:"""
return self.llm.chat.completions.create(
model='gpt-4o',
messages=[{"role": "user", "content": prompt}]
).choices[0].message.content.strip()
def query(self, question: str) -> str:
# Try structured graph query
cypher = self.cypher_query_from_nl(question)
with self.driver.session() as session:
graph_results = session.run(cypher).data()
# Also do vector retrieval
vector_results = self.vector_retriever.search(question)
# Combine both
combined_context = self.format_graph_results(graph_results) + \
self.format_vector_results(vector_results)
return self.generator.generate(question, combined_context)
8. BUILDING YOUR OWN RAG SERVICE
8.1 System Design β Complete Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β Web App | Mobile | Slack Bot | API Clients β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β HTTPS
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
β API GATEWAY β
β (Kong / AWS API Gateway / Nginx) β
β Rate limiting, Auth (JWT/OAuth), Routing β
ββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ-β
β β β
ββββββββΌβββββββ βββββββΌβββββββ ββββββΌβββββββββββ
β Query API β β Ingest API β β Admin API β
β (FastAPI) β β (FastAPI) β β (FastAPI) β
ββββββββ¬βββββββ βββββββ¬βββββββ βββββββββββββββββ
β β
ββββββββΌβββββββββββββββΌββββββββββββββββββββββββββββββββ
β RAG ORCHESTRATION LAYER β
β Query Preprocessing β Retrieval β Reranking β β
β Context Assembly β Generation β Post-processing β
ββββββββ¬ββββββββββββββββββββββββββββ¬βββββββββββββββββββ
β β
ββββββββΌβββββββββββ βββββββββββββΌβββββββββββββββββββ
β RETRIEVAL β β GENERATION β
β βββββββββββββ β β ββββββββββββββββββββββββββ β
β Qdrant/Milvus β β vLLM (LLaMA/Mistral) β
β (Vector Store) β β or OpenAI/Anthropic API β
β β β β
β Elasticsearch β β Prompt Templates β
β (BM25 Search) β β Context Compression β
β β β Response Streaming β
β Redis β βββββββββββββββββββββββββββββββββ
β (Semantic Cache)β
βββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LAYER β
β PostgreSQL (metadata) | S3/GCS (raw docs) β
β Redis (sessions/cache) | Neo4j (knowledge graph) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OBSERVABILITY LAYER β
β Prometheus + Grafana | Langfuse | Sentry β
β OpenTelemetry | ELK Stack (logs) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
8.2 Docker Compose for Full Stack
# docker-compose.yml
version: '3.8'
services:
rag-api:
build: ./api
ports: ["8000:8000"]
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- QDRANT_URL=http://qdrant:6333
- REDIS_URL=redis://redis:6379
- POSTGRES_URL=postgresql://user:pass@postgres:5432/ragdb
depends_on: [qdrant, redis, postgres]
qdrant:
image: qdrant/qdrant:latest
ports: ["6333:6333", "6334:6334"]
volumes: ["./qdrant_data:/qdrant/storage"]
redis:
image: redis:7-alpine
ports: ["6379:6379"]
command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
postgres:
image: pgvector/pgvector:pg16
ports: ["5432:5432"]
environment:
POSTGRES_DB: ragdb
POSTGRES_USER: user
POSTGRES_PASSWORD: password
volumes: ["./postgres_data:/var/lib/postgresql/data"]
nginx:
image: nginx:alpine
ports: ["80:80", "443:443"]
volumes: ["./nginx.conf:/etc/nginx/nginx.conf"]
depends_on: [rag-api]
langfuse:
image: langfuse/langfuse:latest
ports: ["3000:3000"]
environment:
- DATABASE_URL=postgresql://user:pass@postgres:5432/langfuse
8.3 Semantic Caching
import redis
import numpy as np
from sentence_transformers import SentenceTransformer
import json
class SemanticCache:
"""Cache responses for semantically similar queries"""
def __init__(self, threshold=0.95, ttl=3600):
self.redis = redis.Redis(host='localhost', port=6379, decode_responses=False)
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.threshold = threshold
self.ttl = ttl
def _get_cache_keys(self):
return [k.decode() for k in self.redis.keys("cache:*")]
def get(self, query: str) -> dict | None:
query_emb = self.embedder.encode([query])[0]
cache_keys = self._get_cache_keys()
for key in cache_keys:
cached = self.redis.get(key)
if not cached:
continue
data = json.loads(cached)
cached_emb = np.array(data['embedding'])
# Cosine similarity
sim = np.dot(query_emb, cached_emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(cached_emb)
)
if sim >= self.threshold:
return data['response'] # Cache hit!
return None # Cache miss
def set(self, query: str, response: dict):
embedding = self.embedder.encode([query])[0].tolist()
key = f"cache:{hash(query)}"
self.redis.setex(
key,
self.ttl,
json.dumps({'embedding': embedding, 'response': response})
)
9. CUTTING-EDGE DEVELOPMENTS (2024-2025)
9.1 Long-Context vs RAG Debate
- Gemini 1.5 Pro: 1M+ token context window
- Claude 3.5: 200K context
- The reality: RAG still wins for large corpora (billions of tokens), cost efficiency, and dynamic updates
- Hybrid approach: RAG for retrieval, long context for reasoning over retrieved docs
- Lost-in-the-middle problem: LLMs struggle with middle-of-context info; RAG helps by limiting context
9.2 Late Chunking (2024)
- Embed full documents, then chunk embeddings (not text)
- Preserves full document context in each chunk embedding
- Jina AI approach: `jina-embeddings-v3`
- Better than traditional chunk-first-embed-second
9.3 Contextual Retrieval (Anthropic, 2024)
- Prepend context to each chunk before embedding
- Prompt: "Here is a document: {DOCUMENT}. Please give a short context for this chunk: {CHUNK_CONTENT}"
- Reduces retrieval failures by 49%
- Combined with BM25: 67% reduction in failures
9.4 Speculative RAG (2024)
- Smaller model generates draft answer + reasoning
- Larger model verifies and refines
- 2-4x faster than single large model RAG
9.5 RAG Fusion & Adaptive RAG
- Multiple retrieval strategies fused with learned weights
- Adaptive: LLM decides retrieval strategy per query
- FLARE: retrieve only when generation uncertainty is high
- Self-RAG: generate, critique, and regenerate
9.6 Multimodal RAG (2024-2025)
- ColPali: PDF retrieval using vision encoder (no text extraction needed!)
- Embed PDF pages as images using PaliGemma
- Retrieve relevant pages, feed to multimodal LLM
- Video RAG: temporal grounding in video content
- Audio RAG: whisper transcription + speaker diarization
9.7 Structured Output & Tool-Augmented RAG
- LLM generates SQL/Cypher to query databases
- NL2SQL: Text-to-SQL for structured data
- Tool-augmented RAG: retrieval + calculation + code execution
- Instructor library for guaranteed JSON output
9.8 Embedding Innovations (2025)
- Matryoshka embeddings (MRL): single model, multiple dimensions
- Binary quantization with rescoring: 40x faster, 0.3% accuracy loss
- Int8 quantization: 2x faster, negligible accuracy loss
- Multi-vector embeddings: multiple vectors per document (ColBERT-style)
9.9 Open-Source RAG Stacks (2025)
- R2R (SciPhi): Production RAG framework with built-in analytics
- Verba (Weaviate): Golden RAGie, complete open-source RAG app
- RAGFlow: Deep document understanding RAG
- Cognita (Truefoundry): Modular RAG framework
- Kotaemon: Document QA with citations
- AnythingLLM: All-in-one self-hosted RAG desktop
9.10 RAG + Agents (2025 Trend)
- OpenAI Deep Research: Multi-step web RAG with reasoning
- Perplexity Sonar: Real-time RAG with citations
- You.com Research: Agent-based RAG pipeline
- Trend: RAG evolving into full agentic research systems
10. PROJECT IDEAS: BEGINNER TO ADVANCED
π’ BEGINNER LEVEL (Week 1-4)
Project 1: Personal Document Chatbot
- Goal: Chat with your own PDFs/notes
- Tech: LangChain + OpenAI + ChromaDB + Streamlit
- Steps:
- Upload PDF via Streamlit UI
- Parse with PyPDF2
- Chunk with RecursiveCharacterTextSplitter
- Embed with OpenAI embeddings
- Store in ChromaDB (local)
- Query with conversational chain
- Skills Learned: Basic RAG pipeline, UI creation
- Time: 2-3 days
Project 2: FAQ Bot for a Website
- Goal: Answer questions from a website's content
- Tech: Scrapy + Sentence-Transformers + FAISS + FastAPI
- Steps:
- Scrape website content
- Clean and chunk HTML
- Embed with MiniLM
- Build FAISS index
- Create FastAPI endpoint
- Return top-3 answers with sources
- Skills Learned: Web scraping, REST API, FAISS
- Time: 3-5 days
Project 3: Local AI Assistant (Fully Offline)
- Goal: RAG system with no API costs
- Tech: Ollama (LLaMA 3.1 8B) + ChromaDB + nomic-embed-text
- Steps:
- Install Ollama, pull LLaMA 3.1 8B
- Use Ollama for embeddings (nomic-embed-text)
- Build local ChromaDB index
- Chat interface with Gradio
- Skills Learned: Local LLMs, privacy-first RAG
- Time: 1-2 days
π‘ INTERMEDIATE LEVEL (Week 5-12)
Project 4: Advanced Legal/Medical Document RAG
- Goal: High-accuracy domain-specific RAG with citations
- Tech: LlamaIndex + Qdrant + BGE Reranker + GPT-4
- Features:
- Semantic chunking for legal documents
- Hybrid BM25 + dense retrieval
- Cross-encoder reranking
- Page-level citations
- Confidence scores
- "I don't know" detection
- Skills Learned: Domain RAG, hybrid retrieval, citations
- Time: 1-2 weeks
Project 5: Multi-Document Research Assistant
- Goal: Compare and synthesize across 100+ documents
- Tech: LangGraph + HyDE + Multi-Query + Cohere Rerank
- Features:
- Upload multiple documents
- Cross-document synthesis
- Contradiction detection
- Source attribution matrix
- Export to report
- Skills Learned: Agentic RAG, complex synthesis
- Time: 2 weeks
Project 6: Conversational RAG with Memory
- Goal: Chat that remembers past conversations
- Tech: LangChain + PostgreSQL + pgvector + Redis
- Features:
- User sessions and history
- Query contextualization
- Long-term memory storage in pg
- Short-term session cache in Redis
- Personal knowledge base per user
- Skills Learned: Conversational AI, session management
- Time: 2 weeks
Project 7: Code Documentation RAG
- Goal: Chat with a large codebase
- Tech: Tree-sitter + BGE + Qdrant + Claude
- Features:
- Parse code into semantic chunks (function/class level)
- Include docstrings and comments
- Dependency graph extraction
- "How does X work?" β returns relevant code + explanation
- Skills Learned: Code understanding, AST parsing
- Time: 1-2 weeks
π΄ ADVANCED LEVEL (Week 13-24)
Project 8: Production RAG SaaS
- Goal: Multi-tenant RAG service with billing
- Tech: FastAPI + Qdrant + vLLM + Stripe + Auth0 + K8s
- Features:
- Multi-tenant isolation (namespace per user)
- Rate limiting and quota management
- Usage-based billing with Stripe
- Admin dashboard
- Webhook for document events
- SLA monitoring
- Auto-scaling based on load
- Skills Learned: SaaS architecture, multitenancy, DevOps
- Time: 4-6 weeks
Project 9: Real-Time RAG with Web Search
- Goal: Answer questions with live web data
- Tech: Tavily/SerpAPI + LangGraph + GPT-4 + Streaming
- Features:
- Combine internal docs with web search
- CRAG pattern (evaluate, search web if needed)
- Streaming responses (SSE)
- Source freshness scoring
- Fact verification step
- Skills Learned: Agentic RAG, streaming, web augmentation
- Time: 3 weeks
Project 10: GraphRAG Knowledge System
- Goal: Interconnected knowledge with graph traversal
- Tech: Neo4j + SpaCy + LangChain + GPT-4
- Features:
- Entity + relationship extraction
- Community detection for summarization
- Graph-vector hybrid retrieval
- Relationship-aware answers
- Knowledge graph visualization
- Skills Learned: Knowledge graphs, NLP, graph databases
- Time: 4-6 weeks
Project 11: Fine-Tuned Embedding + RAG Pipeline
- Goal: Custom embedding model for your domain
- Tech: Sentence-Transformers + MTEB + Qdrant
- Steps:
- Collect domain Q&A pairs (1000+ examples)
- Fine-tune MiniLM with MNRL loss
- Evaluate on BEIR
- Deploy fine-tuned model
- Compare vs generic embeddings
- Skills Learned: Embedding fine-tuning, MTEB evaluation
- Time: 3 weeks
Project 12: MultiModal RAG with ColPali
- Goal: Search PDFs using visual understanding (no OCR!)
- Tech: ColPali + PaliGemma + GPT-4V + Qdrant
- Features:
- Index PDF pages as images
- Visual similarity search
- Answer questions about charts/tables/diagrams
- No text extraction required
- Skills Learned: Vision models, multimodal search
- Time: 3-4 weeks
11. REVERSE ENGINEERING EXISTING SYSTEMS
11.1 How to Reverse Engineer RAG Products
Step 1: Black-Box Testing
- Send queries and observe:
- Response latency (retrieval time hint)
- Citation format (chunk size hint)
- "I don't know" behavior
- Max context length behavior
- Streaming vs batch response
- Error messages (reveal stack)
Step 2: Analyze Behavior Patterns
Perplexity.ai analysis:
- Always cites web sources β live web search
- Fast response β parallel retrieval + small reranker
- Shows source snippets β 200-500 token chunks
- Sometimes "searches for" β agentic step visible
- Exact quote matching β BM25 + dense hybrid
ChatGPT with file upload:
- 512-1024 token chunks (context window visible)
- Summarizes long docs β retrieval + summarization
- Loses information in large files β fixed context budget
Step 3: Reconstruct Architecture
# Reconstruct Perplexity-like system:
class PerplexityClone:
def query(self, question: str) -> str:
# 1. Classify query intent
intent = self.classify(question) # factual, conversational, code
# 2. Generate search queries
queries = self.generate_search_queries(question, n=3)
# 3. Parallel web search
results = asyncio.gather(*[
self.web_search(q) for q in queries
])
# 4. Parse and chunk results
chunks = self.parse_search_results(results)
# 5. Rerank
ranked = self.reranker.rerank(question, chunks)
# 6. Generate with citations
return self.generate_with_citations(question, ranked[:5])
11.2 Reverse Engineering Specific Systems
Notion AI
Observations:
- Context-aware (knows current page)
- Generates in Notion format (markdown blocks)
- Personal workspace knowledge
Likely Architecture:
- Workspace indexed per user in tenant-isolated vector store
- Block-level chunking (Notion's atomic units)
- Metadata filtering by workspace/page/user
- Fine-tuned generation for Notion markdown output
GitHub Copilot
Observations:
- Uses surrounding code context
- Repository-wide understanding
- Language-specific knowledge
Likely Architecture:
- File-level and function-level chunking by AST
- BM25 on identifiers + dense on semantics
- Sliding window context of open files
- Fill-in-the-middle (FIM) trained model
- Repository-level RAG for cross-file context
12. PRODUCTION DEPLOYMENT & MLOPS
12.1 RAG Pipeline Testing
# Unit testing RAG components
import pytest
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision
)
from datasets import Dataset
class TestRAGPipeline:
def test_retrieval_recall(self):
"""Ensure retrieval finds known-relevant docs"""
test_cases = [
{
"query": "What is the refund policy?",
"expected_doc_id": "policy_doc_001"
}
]
for case in test_cases:
results = retriever.search(case['query'], top_k=5)
ids = [r['id'] for r in results]
assert case['expected_doc_id'] in ids
def test_no_hallucination(self):
"""Answers must be grounded in context"""
question = "What is the capital of France?"
context = ["France is a country in Western Europe."] # No capital mentioned
answer = generator.generate(question, context)
# Should say "not in context" not "Paris"
assert "not" in answer.lower() or "don't" in answer.lower()
def test_ragas_metrics(self):
"""Run RAGAS evaluation on test set"""
data = Dataset.from_dict({
"question": test_questions,
"answer": generated_answers,
"contexts": retrieved_contexts,
"ground_truth": ground_truth_answers
})
results = evaluate(data, metrics=[
faithfulness, answer_relevancy,
context_recall, context_precision
])
assert results['faithfulness'] > 0.85
assert results['context_precision'] > 0.75
12.2 CI/CD Pipeline for RAG
# .github/workflows/rag-pipeline.yml
name: RAG Pipeline CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with: {python-version: '3.11'}
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run unit tests
run: pytest tests/unit
- name: Run integration tests
run: pytest tests/integration
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Evaluate RAG quality
run: python scripts/evaluate_rag.py --threshold 0.8
- name: Build Docker image
run: docker build -t rag-service:${{ github.sha }} .
- name: Deploy to staging
if: github.ref == 'refs/heads/main'
run: ./scripts/deploy.sh staging
12.3 Monitoring & Alerting
# Prometheus metrics for RAG service
from prometheus_client import Counter, Histogram, Gauge
rag_requests_total = Counter(
'rag_requests_total',
'Total RAG requests',
['status', 'route']
)
rag_latency_seconds = Histogram(
'rag_latency_seconds',
'RAG request latency',
['stage'], # retrieval, reranking, generation, total
buckets=[0.1, 0.3, 0.5, 1.0, 2.0, 5.0, 10.0]
)
retrieved_chunks_gauge = Gauge(
'retrieved_chunks_count',
'Number of chunks retrieved per request'
)
hallucination_rate = Counter(
'rag_hallucinations_detected',
'Responses flagged as hallucinations'
)
# Alert rules (Grafana/AlertManager)
ALERT_RULES = {
"high_latency": "p99 > 5s for 5min",
"low_faithfulness": "faithfulness < 0.7 for 10min",
"high_error_rate": "errors > 5% for 2min",
"vector_db_down": "qdrant_health = 0 for 1min"
}
13. RESEARCH PAPERS & RESOURCES
13.1 Foundational Papers
- Lewis et al. (2020) β "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"
- The original RAG paper (Facebook AI)
- Karpukhin et al. (2020) β "Dense Passage Retrieval for Open-Domain Question Answering" (DPR)
- Izacard & Grave (2021) β "Leveraging Passage Retrieval with Generative Models for Open Domain QA" (FiD)
- Khattab & Zaharia (2020) β "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction"
- Robertson & Zaragoza (2009) β "The Probabilistic Relevance Framework: BM25 and Beyond"
- Malkov & Yashunin (2016) β "Efficient and Robust Approximate Nearest Neighbor Search Using HNSW"
- Gao et al. (2022) β "Precise Zero-Shot Dense Retrieval without Relevance Labels" (HyDE)
13.2 Advanced Papers (2023-2025)
- Asai et al. (2023) β "Self-RAG: Learning to Retrieve, Generate, and Critique"
- Shi et al. (2023) β "REPLUG: Retrieval-Augmented Black-Box Language Models"
- Edge et al. (2024) β "From Local to Global: A GraphRAG Approach" (Microsoft)
- Sarthi et al. (2024) β "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval"
- Yan et al. (2024) β "Corrective Retrieval Augmented Generation" (CRAG)
- Faysse et al. (2024) β "ColPali: Efficient Document Retrieval with Vision Language Models"
- Su et al. (2024) β "Contextual Retrieval" (Anthropic Blog)
- Zhao et al. (2024) β "Retrieval-Augmented Generation for AI-Generated Content: A Survey"
13.3 Learning Resources
Courses:
- DeepLearning.AI: "Building and Evaluating Advanced RAG" (free)
- DeepLearning.AI: "LangChain for LLM Application Development"
- DeepLearning.AI: "Vector Databases: from Embeddings to Applications"
- fast.ai: "Practical Deep Learning for Coders"
- Hugging Face: NLP Course (free, comprehensive)
Books:
- "Hands-On Large Language Models" β Jay Alammar & Maarten Grootendorst (2024)
- "Building LLM Apps" β Valentina Alto (2024)
- "The NLP Practitioner's Handbook" (multiple authors)
- "Designing Machine Learning Systems" β Chip Huyen
YouTube Channels:
- Andrej Karpathy β deep model understanding
- AI Explained β RAG and LLM news
- Sam Witteveen β practical LLM tutorials
- James Briggs β RAG tutorials
Communities:
- r/LocalLLaMA (Reddit) β self-hosted focus
- Hugging Face Discord β model discussions
- LangChain Discord β framework help
- LlamaIndex Discord β RAG specific
Benchmarks:
- MTEB β embedding model benchmark
- BEIR β IR benchmark
- RAGAS β RAG evaluation
- LLM-as-Judge benchmarks
- LMSYS Chatbot Arena
Quick-Start Checklist
Week 1-2: Get Your First RAG Working
- [ ] Install: `pip install langchain openai chromadb sentence-transformers`
- [ ] Get OpenAI API key
- [ ] Run basic RAG on 3 PDF files
- [ ] Understand the 3 components: chunk β embed β retrieve β generate
Week 3-4: Level Up Retrieval
- [ ] Implement BM25 with `rank_bm25`
- [ ] Try `BAAI/bge-large-en-v1.5` embedding model
- [ ] Set up Qdrant locally with Docker
- [ ] Add cross-encoder reranking
Month 2: Advanced Patterns
- [ ] Implement HyDE
- [ ] Add multi-query retrieval
- [ ] Build conversational RAG with history
- [ ] Evaluate with RAGAS
Month 3: Production Ready
- [ ] FastAPI service with proper error handling
- [ ] Semantic cache with Redis
- [ ] Langfuse tracing
- [ ] Docker Compose deployment
- [ ] CI/CD pipeline
Month 4-6: Own the Stack
- [ ] Fine-tune embeddings on domain data
- [ ] Self-host LLM with vLLM
- [ ] Build agentic RAG with LangGraph
- [ ] Deploy to Kubernetes
- [ ] Monitor with Prometheus + Grafana