🧠 COMPLETE RAG (Retrieval-Augmented Generation) ROADMAP

From Zero to Production β€” Build Your Own Model & Services

Version: 2025.Q1 | Last Updated: March 2026 | Purpose: Educational and Professional Development

Introduction

What is RAG?

RAG (Retrieval-Augmented Generation) is an AI architecture that enhances Large Language Models (LLMs) by connecting them to external knowledge bases at inference time. Instead of relying solely on parametric memory (what's baked into model weights), RAG retrieves relevant documents and feeds them as context β€” producing grounded, accurate, up-to-date answers.

1. FOUNDATION & PREREQUISITES

1.1 Mathematics Foundations

  • Linear Algebra
    • Vectors, matrices, tensors
    • Dot products and cosine similarity (critical for retrieval)
    • Matrix multiplication (used in attention mechanisms)
    • Eigenvalues, SVD (Singular Value Decomposition β€” used in LSA)
    • Vector spaces and subspaces
  • Probability & Statistics
    • Probability distributions (Gaussian, Bernoulli, Categorical)
    • Bayes' Theorem (foundational for probabilistic retrieval)
    • Entropy, KL Divergence, Cross-Entropy (loss functions)
    • Maximum Likelihood Estimation
    • Expectation-Maximization (EM algorithm)
  • Calculus
    • Derivatives and gradients
    • Chain rule (backpropagation)
    • Gradient descent and its variants
    • Partial derivatives and Jacobians
  • Information Theory
    • Shannon entropy
    • Mutual information
    • TF-IDF derivation (term frequency-inverse document frequency)
    • Information gain

1.2 Programming Prerequisites

  • Python (Primary language)
    • Object-oriented programming
    • Async/await, coroutines
    • Type hints and dataclasses
    • Context managers and decorators
    • Generator functions (for streaming)
  • Data Structures & Algorithms
    • Hash maps, trees, heaps
    • k-d trees and ball trees (for ANN search)
    • Graph algorithms (for Knowledge Graphs)
    • Priority queues
  • Software Engineering
    • REST API design (FastAPI, Flask)
    • Microservices architecture
    • Docker and containerization
    • Git version control
    • Unit testing and integration testing

1.3 Machine Learning Fundamentals

  • Supervised vs unsupervised learning
  • Neural networks: perceptrons, activation functions
  • Backpropagation and optimization
  • Overfitting, regularization, dropout
  • Tokenization and vocabulary
  • Word embeddings (Word2Vec, GloVe, FastText)
  • Evaluation metrics: Precision, Recall, F1, NDCG, MRR

1.4 Deep Learning Foundations

  • Recurrent Neural Networks (RNNs, LSTMs, GRUs)
  • Convolutional Neural Networks (CNNs for text)
  • Attention mechanisms (Bahdanau, Luong)
  • Encoder-Decoder architectures
  • Transfer learning and fine-tuning
  • PyTorch or TensorFlow fundamentals

2. CORE CONCEPTS & THEORY

2.1 The Problem RAG Solves

Problem RAG Solution
LLM hallucination Ground answers in retrieved facts
Knowledge cutoff Connect to live/updated databases
Domain specificity Index private documents
Explainability Show source documents
Token limit Retrieve only relevant chunks

2.2 RAG vs. Fine-Tuning vs. In-Context Learning

Fine-Tuning:
  βœ… Learns new behaviors and styles
  ❌ Expensive, static knowledge, hallucination risk
  Use when: Changing model's tone/behavior/reasoning style

RAG:
  βœ… Dynamic knowledge, citable, cheap updates
  ❌ Retrieval latency, chunk quality dependency
  Use when: Need current/domain-specific factual recall

In-Context Learning (Prompt Engineering):
  βœ… Zero training cost
  ❌ Context window limits, no persistence
  Use when: Simple tasks, short documents

Hybrid (RAG + Fine-Tuning):
  βœ… Best of both worlds
  Use when: Production systems requiring both accuracy and style

2.3 Core RAG Pipeline Components

[User Query]
     ↓
[Query Processing]  β€” cleaning, expansion, rewriting
     ↓
[Retriever]  β€”β€”β€”β€”β€”β€”β€”β†’ [Vector Store / Index]
     ↓                        ↑
[Re-Ranker]          [Document Ingestion Pipeline]
     ↓
[Context Assembly]
     ↓
[Generator (LLM)]
     ↓
[Post-Processing]
     ↓
[Response + Citations]

3. STRUCTURED LEARNING PATH

πŸ“ PHASE 0 β€” Orientation (Week 1-2)

  • Read: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020)
  • Understand the original Facebook RAG paper
  • Run a toy RAG demo with LangChain + OpenAI
  • Understand what a vector embedding is visually

πŸ“ PHASE 1 β€” Text Processing & Embeddings (Week 3-6)

1A. Text Preprocessing
  • Tokenization
    • Whitespace tokenization
    • BPE (Byte-Pair Encoding) β€” used in GPT
    • WordPiece β€” used in BERT
    • SentencePiece β€” used in T5, LLaMA
    • Unigram Language Model tokenization
    • Special tokens: [CLS], [SEP], [PAD], [UNK], [MASK]
  • Text Cleaning
    • HTML/Markdown stripping
    • Unicode normalization
    • Stopword removal (context-dependent)
    • Lemmatization vs stemming
    • Named Entity Recognition (NER) for metadata extraction
  • Document Chunking Strategies
    • Fixed-size chunking (naive, 256/512 tokens)
    • Sentence-based chunking (NLTK, SpaCy)
    • Paragraph-based chunking
    • Semantic chunking (split on embedding similarity drops)
    • Recursive character text splitting (LangChain default)
    • Document-aware chunking (respect headings, tables)
    • Sliding window with overlap (e.g., 512 tokens, 50 token overlap)
    • Agentic chunking (use LLM to determine chunk boundaries)
1B. Embeddings
  • Sparse Embeddings (Traditional)
    • Bag of Words (BoW)
    • TF-IDF (Term Frequency–Inverse Document Frequency)
    • BM25 (Best Match 25) β€” still gold standard for keyword search
    • BM25+ and BM25L variants
    • SPLADE (Sparse Lexical and Expansion model)
  • Dense Embeddings (Neural)
    • Word-level: Word2Vec (CBOW, Skip-gram), GloVe, FastText
    • Sentence-level: InferSent, Universal Sentence Encoder
    • Transformer-based: BERT, RoBERTa, ALBERT
    • Bi-encoder architecture (query and doc encoded separately)
    • Cross-encoder architecture (query + doc encoded together)
  • State-of-the-Art Embedding Models
    • `text-embedding-3-large` (OpenAI, 3072-dim)
    • `text-embedding-ada-002` (OpenAI, 1536-dim)
    • `all-MiniLM-L6-v2` (Sentence-Transformers, fast)
    • `BAAI/bge-large-en-v1.5` (BGE family, SOTA open-source)
    • `BAAI/bge-m3` (multilingual, multi-granularity)
    • `Cohere embed-v3` (float, int8, binary quantization)
    • `E5-mistral-7b-instruct` (LLM-based embeddings)
    • `NV-Embed-v2` (NVIDIA, highest MTEB scores as of 2024)
    • `voyage-3` (Voyage AI)
    • `nomic-embed-text-v1.5` (open, 8192 context)
    • `mxbai-embed-large-v1` (Mixedbread)
  • Embedding Evaluation
    • MTEB (Massive Text Embedding Benchmark)
    • BEIR (Benchmarking IR)
    • Tasks: classification, clustering, retrieval, STS, summarization
  • Fine-tuning Embeddings
    • Contrastive learning (SimCSE)
    • Triplet loss: anchor, positive, negative
    • In-batch negatives
    • Hard negative mining
    • MNRL (Multiple Negatives Ranking Loss)
    • Matryoshka Representation Learning (MRL) β€” embeddings at multiple dimensions

πŸ“ PHASE 2 β€” Vector Databases & Indexing (Week 7-10)

2A. Similarity Search Fundamentals
  • Distance Metrics
    • Cosine similarity: `cos(ΞΈ) = (AΒ·B) / (||A|| ||B||)`
    • Euclidean (L2) distance
    • Manhattan (L1) distance
    • Dot product (inner product)
    • Hamming distance (binary vectors)
    • Jaccard similarity (set-based)
  • Exact vs Approximate Nearest Neighbor (ANN)
    • Exact search: brute force O(nΒ·d) β€” only for <100k vectors
    • ANN: trade tiny accuracy loss for massive speed gains
    • Recall@K as evaluation metric
2B. ANN Indexing Algorithms
  • Tree-based
    • KD-Tree (fails in high dimensions, >20D)
    • Ball Tree
    • Random Projection Trees (Annoy)
    • ANNOY (Spotify) β€” forest of random trees
  • Hash-based
    • Locality Sensitive Hashing (LSH)
    • MinHash LSH (for Jaccard)
    • SimHash (for cosine)
    • Multi-probe LSH
  • Graph-based (Best for High-Dim Dense Vectors)
    • NSW (Navigable Small World graphs)
    • HNSW (Hierarchical Navigable Small World) β€” de facto standard
      • Build: insert nodes, connect to nearest neighbors at each layer
      • Query: greedy search from top layer down
      • Parameters: M (connections per node), efConstruction, efSearch
    • DiskANN (Microsoft) β€” disk-based, billion-scale
    • VAMANA (DiskANN underlying algorithm)
    • NGT (Yahoo Japan)
  • Quantization-based
    • Product Quantization (PQ)
      • Split vector into sub-vectors, quantize each
      • 32x compression with ~95% recall
    • Scalar Quantization (SQ) β€” int8, int4 quantization
    • Binary Quantization (BQ) β€” 1-bit, 32x compression
    • Optimized Product Quantization (OPQ)
    • FAISS (Facebook AI Similarity Search)
      • IVF (Inverted File Index) + PQ: `IndexIVFPQ`
      • Flat index: `IndexFlatL2`, `IndexFlatIP`
      • GPU FAISS for billion-scale
2C. Vector Databases (Production-Grade)
Database Type Best For Hosted
Pinecone Managed Production, ease of use βœ… Cloud only
Weaviate Open-source + Cloud Hybrid search, modules βœ…/πŸ”§ Both
Qdrant Open-source + Cloud Performance, filtering βœ…/πŸ”§ Both
Milvus/Zilliz Open-source + Cloud Billion-scale βœ…/πŸ”§ Both
Chroma Open-source Local dev, prototyping πŸ”§ Self-hosted
pgvector PostgreSQL ext Existing Postgres users πŸ”§ Self-hosted
Redis Vector Redis extension Low-latency, caching βœ…/πŸ”§ Both
OpenSearch Open-source Full text + vector hybrid πŸ”§ Self-hosted
Elasticsearch Open-source Enterprise search βœ…/πŸ”§ Both
LanceDB Embedded Serverless, local πŸ”§ Both
Vespa Open-source Complex ranking, ML πŸ”§ Self-hosted
  • Metadata Filtering
    • Pre-filtering (filter then search)
    • Post-filtering (search then filter)
    • Filtered HNSW (Qdrant's approach)
    • Payload indexing
    • Composite filtering (AND, OR, NOT, range queries)
  • Hybrid Search
    • Combining dense (vector) + sparse (keyword) results
    • Reciprocal Rank Fusion (RRF): `score = Ξ£ 1/(k + rank_i)`
    • Weighted sum fusion
    • Learned sparse models: SPLADE, SPLADEv2, uniCOIL

πŸ“ PHASE 3 β€” Retrieval Strategies (Week 11-14)

3A. Basic Retrieval
  • Single-stage dense retrieval
  • BM25 keyword retrieval
  • Hybrid: BM25 + Dense (most common production setup)
3B. Advanced Retrieval Techniques
  • Query Transformation
    • Query expansion (add synonyms, related terms)
    • HyDE (Hypothetical Document Embeddings)
      • Generate a hypothetical answer β†’ embed it β†’ retrieve similar docs
    • Multi-Query Retrieval (generate 3-5 query variants)
    • Step-back prompting (abstract to higher level)
    • Query decomposition (break complex query into sub-queries)
    • FLARE (Forward-Looking Active Retrieval)
  • Retrieval Modes
    • Naive RAG: retrieve top-k β†’ concatenate β†’ generate
    • Sentence Window Retrieval: embed sentences, return surrounding window
    • Auto-merging Retrieval (LlamaIndex): hierarchical chunks
    • Parent-Child Retrieval: embed small chunks, return parent
    • Recursive Retrieval: retrieve β†’ generate β†’ retrieve again
    • Iterative RAG: multi-hop retrieval for complex questions
  • Re-ranking (Critical for Precision)
    • Cross-encoders: encode query+doc together, expensive but accurate
      • `cross-encoder/ms-marco-MiniLM-L-6-v2`
      • `BAAI/bge-reranker-large`
      • Cohere Rerank API
      • `Jina Reranker`
    • ColBERT (Late Interaction)
      • Encode query and doc separately, token-level interaction
      • `MaxSim` operator for scoring
      • RAGatouille library for easy use
    • LLM-based Reranking
      • RankGPT: use LLM to rank passages
      • PairwiseRanker
      • Listwise ranking with LLMs
    • Learning-to-Rank (LTR)
      • Pointwise, pairwise, listwise approaches
      • LambdaMART, XGBoost LTR

πŸ“ PHASE 4 β€” Generation & LLMs (Week 15-20)

4A. Understanding LLMs
  • Transformer Architecture Deep Dive
    • Self-attention: `Attention(Q,K,V) = softmax(QK^T / √d_k)V`
    • Multi-head attention (MHA)
    • Grouped Query Attention (GQA) β€” used in LLaMA 3
    • Multi-Query Attention (MQA) β€” faster inference
    • Feed-forward layers (SwiGLU, GeGLU activations)
    • Positional encodings: sinusoidal, RoPE, ALiBi
    • Layer Normalization (Pre-LN vs Post-LN)
    • KV-Cache mechanism
  • Decoder-only Models (Generation)
    • GPT family (OpenAI): GPT-4o, GPT-4-turbo
    • LLaMA family (Meta): LLaMA 2, LLaMA 3, LLaMA 3.1
    • Mistral family: Mistral 7B, Mixtral 8x7B (MoE)
    • Gemma (Google): Gemma 2B, 7B, 27B
    • Phi family (Microsoft): Phi-3, Phi-3.5
    • Qwen (Alibaba): Qwen2, Qwen2.5
    • Command-R (Cohere) β€” RAG-optimized
    • DeepSeek-V2, V3 β€” MoE architecture
  • Encoder-Decoder Models
    • T5, Flan-T5
    • BART, mBART
    • Original RAG paper used BART as generator
4B. Inference Optimization
  • Quantization
    • GPTQ (Post-Training Quantization, weight-only)
    • AWQ (Activation-aware Weight Quantization)
    • GGUF (llama.cpp format) β€” Q4_K_M, Q5_K_M, Q8_0
    • bitsandbytes (8-bit, 4-bit via NF4)
    • HQQ (Half-Quadratic Quantization)
    • FP8 training and inference (H100 native)
  • Serving Frameworks
    • vLLM (PagedAttention, continuous batching, highest throughput)
    • llama.cpp (CPU inference, GGUF format)
    • Ollama (local LLM server, easy setup)
    • TGI β€” Text Generation Inference (HuggingFace)
    • TensorRT-LLM (NVIDIA, fastest on A100/H100)
    • LMDeploy (InternLM)
    • SGLang (structured generation, fast)
  • Context Length & Memory
    • FlashAttention (memory-efficient attention, 2-4x speedup)
    • FlashAttention-2, FlashAttention-3
    • Sliding window attention (Mistral)
    • Ring attention (distributed long context)
    • Paged KV-Cache (vLLM)
    • GQA/MQA for KV-cache reduction
    • Speculative Decoding (draft model speeds up large model)
4C. Prompt Engineering for RAG
  • System prompt design for RAG
  • Context window budget allocation
  • Citation/grounding instruction prompting
  • Chain-of-thought (CoT) for multi-hop
  • Few-shot RAG examples
  • Structured output prompting (JSON mode)
  • Handling "I don't know" responses
  • Confidence calibration prompting
4D. Fine-tuning for RAG
  • SFT (Supervised Fine-Tuning)
    • Format: `[System] [Retrieved Context] [Query] β†’ [Answer]`
    • Datasets: NarrativeQA, QuALITY, SQUAD, HotpotQA
    • Tools: HuggingFace TRL, Axolotl, LLaMA-Factory
  • Parameter-Efficient Fine-Tuning (PEFT)
    • LoRA (Low-Rank Adaptation): `W = Wβ‚€ + BA` (B, A are low-rank)
    • QLoRA (4-bit quantized LoRA)
    • AdaLoRA (adaptive rank)
    • LoftQ (quantization-aware LoRA init)
    • IAΒ³ (Infused Adapter by Inhibiting and Amplifying)
  • RLHF for RAG
    • PPO (Proximal Policy Optimization)
    • DPO (Direct Preference Optimization) β€” simpler, no reward model
    • RLAIF (RL from AI Feedback)
    • Constitutional AI (Anthropic)

πŸ“ PHASE 5 β€” Evaluation & Observability (Week 21-24)

5A. RAG Evaluation Metrics
  • Retrieval Quality
    • Context Precision: relevant docs / retrieved docs
    • Context Recall: retrieved relevant docs / all relevant docs
    • MRR (Mean Reciprocal Rank): `MRR = (1/|Q|) Ξ£ 1/rank_i`
    • NDCG@K (Normalized Discounted Cumulative Gain)
    • Hit Rate@K
  • Generation Quality
    • Faithfulness: is the answer grounded in context?
    • Answer Relevance: does the answer address the question?
    • Answer Correctness: factual accuracy vs ground truth
    • BLEU, ROUGE (reference-based, less useful for open-ended)
    • BERTScore (semantic similarity)
    • G-Eval (LLM-as-judge)
    • Ragas Score (RAG-specific composite metric)
  • End-to-End Metrics
    • Answer Similarity
    • Semantic Answer Similarity (SAS)
    • ARES (Automated RAG Evaluation System)
    • RAGAS (open-source RAG evaluation framework)
    • TruLens (evaluation + tracking)
    • DeepEval (unit testing for LLMs)
5B. Observability & Monitoring
  • Tracing Tools
    • LangSmith (LangChain native)
    • Phoenix (Arize AI)
    • Langfuse (open-source)
    • W&B Weave (Weights & Biases)
    • Helicone
    • OpenTelemetry for custom tracing
  • Key Metrics to Monitor
    • Latency (P50, P90, P99) per pipeline stage
    • Token usage and cost
    • Retrieval success rate
    • Hallucination rate (using LLM judge)
    • User feedback signals (thumbs up/down)
    • Cache hit rate

4. ALGORITHMS, TECHNIQUES & TOOLS

4.1 Complete Algorithm Reference

Retrieval Algorithms
Sparse:
β”œβ”€β”€ BM25 (Robertson & Zaragoza, 2009) β€” most used baseline
β”œβ”€β”€ BM25+ / BM25L β€” improved variants
β”œβ”€β”€ TF-IDF
β”œβ”€β”€ SPLADE (Formal et al., 2021) β€” learned sparse
β”œβ”€β”€ uniCOIL β€” sparse with BERT
└── DeepImpact β€” learned doc-side weights

Dense:
β”œβ”€β”€ DPR (Dense Passage Retrieval) β€” Facebook, 2020
β”œβ”€β”€ ANCE (Approximate Nearest Neighbor Negative Contrastive)
β”œβ”€β”€ E5 (EmbEddings from bidirEctional Encoder rEpresentations)
β”œβ”€β”€ BGE (BAAI General Embeddings)
β”œβ”€β”€ GTE (General Text Embeddings, Alibaba)
└── SimCSE (Simple Contrastive Sentence Embeddings)

Hybrid:
β”œβ”€β”€ RRF (Reciprocal Rank Fusion)
β”œβ”€β”€ Linear interpolation: score = Ξ±*dense + (1-Ξ±)*sparse
β”œβ”€β”€ PLAID (ColBERT-based efficient retrieval)
└── Learned hybrid weights

Multi-hop:
β”œβ”€β”€ MDR (Multi-hop Dense Retrieval)
β”œβ”€β”€ Baleen (conditioned retrieval)
β”œβ”€β”€ FLARE (Forward-Looking Active Retrieval Augmentation)
└── IRCoT (Interleaving Retrieval with Chain-of-Thought)
Reranking Algorithms
Cross-Encoders:
β”œβ”€β”€ MonoBERT (point-wise)
β”œβ”€β”€ MonoT5 (seq2seq reranker)
β”œβ”€β”€ DuoBERT (pair-wise)
└── RankLLaMA

Late Interaction:
β”œβ”€β”€ ColBERT (Khattab & Zaharia, 2020)
β”œβ”€β”€ ColBERTv2 (residual compression)
β”œβ”€β”€ PLAID (efficient ColBERT)
└── Col-BERT-QA

LLM-based:
β”œβ”€β”€ RankGPT (Sun et al., 2023)
β”œβ”€β”€ PRP (Pairwise Ranking Prompting)
β”œβ”€β”€ LRL (Listwise Reranker)
└── Setwise ranking
Generation Algorithms
Decoding Strategies:
β”œβ”€β”€ Greedy decoding
β”œβ”€β”€ Beam search (width B)
β”œβ”€β”€ Top-K sampling
β”œβ”€β”€ Top-P (nucleus) sampling
β”œβ”€β”€ Temperature scaling
β”œβ”€β”€ Repetition penalty
β”œβ”€β”€ Contrastive decoding
└── Speculative decoding

RAG-specific:
β”œβ”€β”€ Token-level RAG (RETRO-style)
β”œβ”€β”€ Fusion-in-Decoder (FiD)
β”œβ”€β”€ REALM (Retrieval-Enhanced Language Model)
β”œβ”€β”€ kNN-LM (k-nearest neighbors LM)
└── Adaptive Retrieval (decide when to retrieve)

4.2 Complete Tools Ecosystem

Data Processing
Parsing & Loading:
β”œβ”€β”€ LlamaParse (advanced PDF parsing, tables, figures)
β”œβ”€β”€ Unstructured.io (20+ file types)
β”œβ”€β”€ PyPDF2, pdfplumber, pdfminer
β”œβ”€β”€ Docling (IBM, multi-format)
β”œβ”€β”€ Marker (PDF β†’ Markdown, open-source)
β”œβ”€β”€ Camelot, Tabula (table extraction)
β”œβ”€β”€ Beautiful Soup, Scrapy (web scraping)
β”œβ”€β”€ Pandoc (document conversion)
└── Apache Tika (enterprise parsing)

Chunking:
β”œβ”€β”€ LangChain TextSplitters (RecursiveCharacterTextSplitter)
β”œβ”€β”€ LlamaIndex NodeParsers (SentenceWindowNodeParser)
β”œβ”€β”€ Semantic chunking (sentence-transformers based)
β”œβ”€β”€ NLTK (sentence tokenization)
β”œβ”€β”€ SpaCy (NLP pipeline, sentence boundaries)
└── chonkie (fast chunking library)
Orchestration Frameworks
High-Level:
β”œβ”€β”€ LangChain β€” most popular, broad ecosystem
β”œβ”€β”€ LlamaIndex β€” best for document RAG, indexing strategies
β”œβ”€β”€ Haystack (deepset) β€” production-focused
β”œβ”€β”€ DSPy (Stanford) β€” programmatic LLM pipelines
β”œβ”€β”€ AutoGen (Microsoft) β€” multi-agent
└── CrewAI β€” role-based multi-agent

Low-Level (more control):
β”œβ”€β”€ Direct API calls (OpenAI, Anthropic, Together)
β”œβ”€β”€ HuggingFace Transformers + Datasets
β”œβ”€β”€ Instructor (structured outputs)
└── Guidance (constrained generation)

Agentic RAG:
β”œβ”€β”€ LangGraph (stateful agent graphs)
β”œβ”€β”€ LlamaIndex Workflows
β”œβ”€β”€ Phidata
└── Pydantic AI
LLM Access
API Providers:
β”œβ”€β”€ OpenAI (GPT-4o, o1, o3)
β”œβ”€β”€ Anthropic (Claude 3.5 Sonnet, Haiku)
β”œβ”€β”€ Google (Gemini 1.5 Pro, Flash)
β”œβ”€β”€ Cohere (Command-R+, Rerank)
β”œβ”€β”€ Mistral AI (Mistral Large, Codestral)
β”œβ”€β”€ Together AI (open models)
β”œβ”€β”€ Fireworks AI (fast inference)
β”œβ”€β”€ Groq (ultra-fast LPU inference)
└── Perplexity AI

Self-Hosted:
β”œβ”€β”€ Ollama (easiest local setup)
β”œβ”€β”€ vLLM (production, high throughput)
β”œβ”€β”€ llama.cpp (CPU friendly)
β”œβ”€β”€ LM Studio (GUI for local models)
└── Jan.ai (desktop app)
Backend & APIs
Web Frameworks:
β”œβ”€β”€ FastAPI (recommended, async, auto-docs)
β”œβ”€β”€ Flask (simpler)
β”œβ”€β”€ Django (full-stack)
└── Starlette (low-level async)

Databases:
β”œβ”€β”€ PostgreSQL + pgvector
β”œβ”€β”€ SQLite (dev/embedded)
β”œβ”€β”€ MongoDB (document store)
β”œβ”€β”€ Redis (caching, session)
└── Elasticsearch (full-text)

Message Queues:
β”œβ”€β”€ Celery + Redis/RabbitMQ
β”œβ”€β”€ Apache Kafka (high volume)
└── Bull (Node.js, if polyglot)

Caching:
β”œβ”€β”€ Redis (semantic cache)
β”œβ”€β”€ GPTCache (LLM-specific caching)
└── CDN caching for static chunks

5. DESIGN & DEVELOPMENT PROCESS

5.1 Naive RAG β€” Scratch to Working System

Step 1: Document Ingestion Pipeline
# COMPLETE INGESTION PIPELINE

import os
from pathlib import Path
from typing import List, Dict, Any
import hashlib

class DocumentIngestionPipeline:
    def __init__(self, chunk_size=512, chunk_overlap=50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def load_documents(self, directory: str) -> List[Dict]:
        """Load all supported documents from directory"""
        documents = []
        supported = ['.pdf', '.txt', '.md', '.docx', '.html']
        
        for path in Path(directory).rglob('*'):
            if path.suffix in supported:
                content = self.extract_text(path)
                doc_id = hashlib.md5(str(path).encode()).hexdigest()
                documents.append({
                    'id': doc_id,
                    'content': content,
                    'metadata': {
                        'source': str(path),
                        'filename': path.name,
                        'file_type': path.suffix,
                        'created_at': os.path.getctime(path)
                    }
                })
        return documents
    
    def chunk_documents(self, documents: List[Dict]) -> List[Dict]:
        """Split documents into overlapping chunks"""
        chunks = []
        for doc in documents:
            text = doc['content']
            words = text.split()
            
            for i in range(0, len(words), self.chunk_size - self.chunk_overlap):
                chunk_words = words[i:i + self.chunk_size]
                chunk_text = ' '.join(chunk_words)
                chunk_id = f"{doc['id']}_{i}"
                
                chunks.append({
                    'id': chunk_id,
                    'text': chunk_text,
                    'metadata': {
                        **doc['metadata'],
                        'chunk_index': i // (self.chunk_size - self.chunk_overlap),
                        'char_start': len(' '.join(words[:i])),
                    }
                })
        return chunks
Step 2: Embedding & Indexing
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import pickle

class EmbeddingIndexer:
    def __init__(self, model_name='BAAI/bge-large-en-v1.5'):
        self.model = SentenceTransformer(model_name)
        self.index = None
        self.chunk_store = {}
        self.dim = self.model.get_sentence_embedding_dimension()
    
    def build_index(self, chunks: List[Dict]):
        """Build FAISS HNSW index from chunks"""
        texts = [chunk['text'] for chunk in chunks]
        
        # Encode in batches
        embeddings = self.model.encode(
            texts, 
            batch_size=32, 
            show_progress_bar=True,
            normalize_embeddings=True  # for cosine similarity
        )
        
        # Create HNSW index (best for dense retrieval)
        self.index = faiss.IndexHNSWFlat(self.dim, 32)  # M=32
        self.index.hnsw.efConstruction = 200
        self.index.add(embeddings.astype('float32'))
        
        # Store chunks for retrieval
        for i, chunk in enumerate(chunks):
            self.chunk_store[i] = chunk
        
        print(f"Indexed {len(chunks)} chunks")
    
    def search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Retrieve top-k relevant chunks"""
        query_embedding = self.model.encode(
            [query], normalize_embeddings=True
        )
        
        self.index.hnsw.efSearch = 50
        distances, indices = self.index.search(
            query_embedding.astype('float32'), top_k
        )
        
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx != -1:
                chunk = self.chunk_store[idx].copy()
                chunk['score'] = float(dist)
                results.append(chunk)
        
        return results
Step 3: Generation with Context
from openai import OpenAI

class RAGGenerator:
    def __init__(self, model='gpt-4o-mini'):
        self.client = OpenAI()
        self.model = model
    
    def generate(self, query: str, retrieved_chunks: List[Dict]) -> Dict:
        """Generate answer with retrieved context"""
        
        # Build context string with citations
        context_parts = []
        for i, chunk in enumerate(retrieved_chunks, 1):
            context_parts.append(
                f"[Source {i}: {chunk['metadata']['filename']}]\n{chunk['text']}"
            )
        context = "\n\n---\n\n".join(context_parts)
        
        system_prompt = """You are a precise, helpful assistant. Answer questions 
        based ONLY on the provided context. If the context doesn't contain 
        enough information, say "I don't have enough information to answer this."
        Always cite which source you used (e.g., [Source 1])."""
        
        user_message = f"""Context:
{context}

Question: {query}

Answer (with citations):"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message}
            ],
            temperature=0.1,
            max_tokens=1000
        )
        
        return {
            'answer': response.choices[0].message.content,
            'sources': [c['metadata']['source'] for c in retrieved_chunks],
            'usage': response.usage.dict()
        }

# Full Pipeline
class NaiveRAG:
    def __init__(self):
        self.indexer = EmbeddingIndexer()
        self.generator = RAGGenerator()
    
    def ingest(self, directory: str):
        pipeline = DocumentIngestionPipeline()
        docs = pipeline.load_documents(directory)
        chunks = pipeline.chunk_documents(docs)
        self.indexer.build_index(chunks)
    
    def query(self, question: str, top_k: int = 5) -> Dict:
        chunks = self.indexer.search(question, top_k)
        return self.generator.generate(question, chunks)

5.2 Advanced RAG β€” Production System

Advanced Chunking with Semantic Splitting
from sklearn.metrics.pairwise import cosine_similarity

def semantic_chunking(text: str, model, threshold: float = 0.5) -> List[str]:
    """Split text where semantic similarity drops significantly"""
    sentences = split_into_sentences(text)
    embeddings = model.encode(sentences)
    
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        # Compare current sentence to previous
        sim = cosine_similarity(
            embeddings[i-1:i], embeddings[i:i+1]
        )[0][0]
        
        if sim < threshold:  # Semantic boundary detected
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks
HyDE (Hypothetical Document Embeddings)
def hyde_retrieval(query: str, llm_client, embedder, index) -> List[Dict]:
    """Generate hypothetical answer to improve retrieval"""
    
    # Generate hypothetical document
    hypo_prompt = f"""Write a short, factual paragraph that would directly 
    answer this question: {query}
    Write as if you know the answer. Be specific."""
    
    response = llm_client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": hypo_prompt}],
        max_tokens=200
    )
    hypothetical_doc = response.choices[0].message.content
    
    # Encode hypothetical doc (instead of raw query)
    hypo_embedding = embedder.encode([hypothetical_doc], normalize_embeddings=True)
    
    # Retrieve using hypothetical embedding
    distances, indices = index.search(hypo_embedding.astype('float32'), 5)
    
    return [chunk_store[idx] for idx in indices[0] if idx != -1]
Multi-Query Retrieval
def multi_query_retrieval(query: str, llm_client, retriever) -> List[Dict]:
    """Generate multiple query variants for diverse retrieval"""
    
    prompt = f"""Generate 4 different search queries to find information about:
    "{query}"
    
    Return ONLY the queries, one per line, no numbering."""
    
    response = llm_client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200
    )
    
    queries = [query] + response.choices[0].message.content.strip().split('\n')
    
    # Retrieve for each query
    all_chunks = {}
    for q in queries:
        results = retriever.search(q, top_k=3)
        for chunk in results:
            # Deduplicate by chunk ID
            all_chunks[chunk['id']] = chunk
    
    # Sort by best score
    return sorted(all_chunks.values(), key=lambda x: x['score'], reverse=True)[:5]
Reranking Pipeline
from sentence_transformers import CrossEncoder

def rerank_with_cross_encoder(
    query: str, 
    chunks: List[Dict], 
    model_name='BAAI/bge-reranker-large'
) -> List[Dict]:
    """Rerank retrieved chunks using cross-encoder"""
    
    reranker = CrossEncoder(model_name)
    
    # Create (query, passage) pairs
    pairs = [(query, chunk['text']) for chunk in chunks]
    
    # Score all pairs
    scores = reranker.predict(pairs)
    
    # Sort by reranker score
    ranked = sorted(
        zip(scores, chunks),
        key=lambda x: x[0],
        reverse=True
    )
    
    for score, chunk in ranked:
        chunk['rerank_score'] = float(score)
    
    return [chunk for _, chunk in ranked]

5.3 Full Production RAG Architecture (FastAPI)

# main.py β€” Production RAG Service
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import Optional
import uvicorn
import asyncio
from contextlib import asynccontextmanager

# Models
class QueryRequest(BaseModel):
    question: str
    top_k: int = 5
    use_reranking: bool = True
    use_hyde: bool = False
    conversation_id: Optional[str] = None

class QueryResponse(BaseModel):
    answer: str
    sources: list[str]
    retrieval_time_ms: float
    generation_time_ms: float
    chunks_retrieved: int

class IngestRequest(BaseModel):
    source_url: Optional[str] = None
    content: Optional[str] = None
    metadata: Optional[dict] = None

# Global components
rag_components = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    rag_components['retriever'] = HybridRetriever()
    rag_components['reranker'] = CrossEncoderReranker()
    rag_components['generator'] = RAGGenerator()
    print("RAG service ready!")
    yield
    # Shutdown cleanup

app = FastAPI(
    title="RAG Service API",
    description="Production RAG service with hybrid retrieval",
    lifespan=lifespan
)

app.add_middleware(CORSMiddleware, allow_origins=["*"])

@app.post("/query", response_model=QueryResponse)
async def query_endpoint(request: QueryRequest):
    import time
    
    # Retrieval
    t0 = time.time()
    retriever = rag_components['retriever']
    chunks = await retriever.aretrieve(request.question, request.top_k * 2)
    retrieval_ms = (time.time() - t0) * 1000
    
    # Reranking
    if request.use_reranking:
        chunks = rag_components['reranker'].rerank(request.question, chunks)
    
    chunks = chunks[:request.top_k]
    
    # Generation
    t1 = time.time()
    result = await rag_components['generator'].agenerate(
        request.question, chunks
    )
    gen_ms = (time.time() - t1) * 1000
    
    return QueryResponse(
        answer=result['answer'],
        sources=result['sources'],
        retrieval_time_ms=retrieval_ms,
        generation_time_ms=gen_ms,
        chunks_retrieved=len(chunks)
    )

@app.post("/ingest")
async def ingest_endpoint(request: IngestRequest, background_tasks: BackgroundTasks):
    background_tasks.add_task(
        rag_components['retriever'].ingest_async,
        request.content,
        request.metadata
    )
    return {"status": "ingestion_queued"}

@app.get("/health")
async def health():
    return {"status": "healthy", "chunks_indexed": rag_components['retriever'].count()}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

6. WORKING PRINCIPLES, ARCHITECTURES & HARDWARE

6.1 RAG Architecture Variants

A. Naive RAG (2020 β€” Lewis et al.)
Query β†’ Embed β†’ FAISS Search β†’ Top-K Chunks β†’ LLM β†’ Answer
Pros: Simple, fast, works out of the box
Cons: Retrieval quality limits answer quality
B. Advanced RAG (2023+)
Query β†’ [Rewrite/Expand] β†’ [Hybrid Search] β†’ [Rerank] β†’ [Filtered Context] β†’ LLM
Adds: HyDE, multi-query, re-ranking, context compression
C. Modular RAG (2023+)
Configurable modules:
β”œβ”€β”€ Search: web search, database, vector, knowledge graph
β”œβ”€β”€ Memory: short-term (context), long-term (vector store)
β”œβ”€β”€ Fusion: merge results from multiple sources
β”œβ”€β”€ Routing: decide which retriever to use
β”œβ”€β”€ Generator: choose LLM, prompt template
└── Predict: generate structured outputs
D. Agentic RAG (2024+)
Query β†’ [LLM Agent]
              ↓
    [Tool Selection]
    β”œβ”€β”€ Vector Search Tool
    β”œβ”€β”€ Web Search Tool
    β”œβ”€β”€ SQL Query Tool
    β”œβ”€β”€ Calculator Tool
    └── Code Execution Tool
              ↓
    [Multi-step Reasoning]
              ↓
    [Final Answer]
E. Graph RAG (Microsoft, 2024)
Documents β†’ [Entity Extraction] β†’ [Knowledge Graph]
Query β†’ [Community Detection] β†’ [Graph Traversal] β†’ [Summarization] β†’ Answer

Excellent for: complex, interconnected domains
Tools: Microsoft GraphRAG, Neo4j + LangChain
F. RAPTOR (Tree RAG, 2024)
Chunks β†’ [UMAP + GMM Clustering] β†’ [Summarize Cluster] β†’ Higher-level nodes
         β†’ [Cluster summaries] β†’ [Summarize again] β†’ Root node

Multi-level retrieval from leaf to root
Best for: long documents, hierarchical knowledge
G. Corrective RAG (CRAG, 2024)
Query β†’ Retrieve β†’ [Relevance Evaluator]
                        β”œβ”€β”€ Relevant: use docs
                        β”œβ”€β”€ Ambiguous: refine + web search
                        └── Irrelevant: web search + filter
                               ↓
                          Generate Answer
H. Self-RAG (2023)
Query β†’ LLM decides: [Retrieve? Yes/No]
If Yes β†’ Retrieve β†’ LLM critiques: [IsRel? IsSup? IsUse?]
                  β†’ Generate with self-reflection tokens
                  β†’ [ISREL] [ISSUP] [ISUSE] special tokens

Best for: adaptive retrieval without always retrieving

6.2 Hardware Requirements

For Development & Prototyping
Minimum (API-based RAG):
β”œβ”€β”€ CPU: Any modern 4-core CPU
β”œβ”€β”€ RAM: 16GB
β”œβ”€β”€ Storage: 50GB SSD
β”œβ”€β”€ GPU: Not required (using OpenAI/Anthropic APIs)
β”œβ”€β”€ Network: Stable broadband
└── Cost: ~$50-200/month (API costs)

Recommended Dev Setup:
β”œβ”€β”€ CPU: Apple M2/M3 or AMD Ryzen 9
β”œβ”€β”€ RAM: 32-64GB (for local models)
β”œβ”€β”€ Storage: 500GB NVMe SSD
β”œβ”€β”€ GPU: RTX 3090 (24GB VRAM) β€” run 13B models
└── OS: Linux (Ubuntu 22.04) or macOS
For Running Local LLMs (Self-Hosted)
Small Models (7B params):
β”œβ”€β”€ GPU: RTX 3080 (10GB) β€” quantized (Q4)
β”œβ”€β”€ RAM: 32GB system RAM
β”œβ”€β”€ VRAM needed: ~6GB for Q4, ~14GB for FP16
└── Models: LLaMA 3.1 8B, Mistral 7B, Gemma 7B

Medium Models (13B-30B):
β”œβ”€β”€ GPU: RTX 3090/4090 (24GB) for Q4
β”œβ”€β”€ Multi-GPU: 2x RTX 3090 for FP16
β”œβ”€β”€ VRAM: ~10GB (Q4), ~26GB (FP16)
└── Models: LLaMA 2 13B, Qwen 14B, Mistral 22B

Large Models (70B):
β”œβ”€β”€ GPU: 4x A100 80GB or 2x H100 80GB
β”œβ”€β”€ Or: 4x RTX 4090 (24GB each) with Q4 quantization
β”œβ”€β”€ VRAM: ~40GB (Q4), ~140GB (FP16)
└── Models: LLaMA 3.1 70B, Qwen 72B

Frontier (405B+):
β”œβ”€β”€ GPU: 8x H100 80GB (minimum)
β”œβ”€β”€ VRAM: ~240GB (Q4), ~810GB (BF16)
└── Models: LLaMA 3.1 405B
For Production RAG Service (Cloud)
Small Scale (<1000 QPS):
β”œβ”€β”€ Vector DB: Qdrant on 32GB RAM, 8 cores
β”œβ”€β”€ LLM: vLLM on 1x A100 40GB
β”œβ”€β”€ API Server: 4 cores, 16GB RAM
└── Estimated cost: $1,500-3,000/month

Medium Scale (1000-10,000 QPS):
β”œβ”€β”€ Vector DB: Pinecone or Qdrant cluster (3 nodes)
β”œβ”€β”€ LLM: vLLM on 2-4x A100 80GB
β”œβ”€β”€ API: Auto-scaling ECS/K8s
β”œβ”€β”€ Cache: Redis cluster
└── Estimated cost: $5,000-15,000/month

Large Scale (>10,000 QPS):
β”œβ”€β”€ Vector DB: Milvus cluster or Pinecone enterprise
β”œβ”€β”€ LLM: TensorRT-LLM on H100 cluster
β”œβ”€β”€ CDN + Global load balancing
└── Estimated cost: $30,000+/month
GPU Comparison for LLM Inference
GPU VRAM Bandwidth FP16 TFLOPS Best For
RTX 4090 24GB 1008 GB/s 82.6 Dev, 7B-13B
A100 40GB 40GB 1555 GB/s 312 Production 13B-70B
A100 80GB 80GB 2039 GB/s 312 Production 70B
H100 SXM 80GB 3350 GB/s 989 Frontier models
H200 SXM 141GB 4800 GB/s 989 Largest models
MI300X 192GB 5300 GB/s 1307 AMD alternative
Apple M3 Max 128GB unified 400 GB/s ~14 Local dev, CPU+GPU

7. ADVANCED RAG PATTERNS

7.1 Agentic RAG with LangGraph

from langgraph.graph import StateGraph, END
from typing import TypedDict, List

class RAGState(TypedDict):
    question: str
    documents: List[str]
    answer: str
    generation_count: int
    needs_web_search: bool

def grade_documents(state: RAGState) -> RAGState:
    """LLM grades each retrieved document for relevance"""
    docs = state['documents']
    question = state['question']
    
    relevant_docs = []
    for doc in docs:
        grade_prompt = f"""Is this document relevant to the question?
        Question: {question}
        Document: {doc[:500]}
        Answer with only: 'yes' or 'no'"""
        
        grade = llm.invoke(grade_prompt).content.strip().lower()
        if grade == 'yes':
            relevant_docs.append(doc)
    
    # If too few relevant docs, trigger web search
    state['documents'] = relevant_docs
    state['needs_web_search'] = len(relevant_docs) < 2
    return state

# Build the graph
workflow = StateGraph(RAGState)
workflow.add_node("retrieve", retrieve_documents)
workflow.add_node("grade_docs", grade_documents)
workflow.add_node("web_search", web_search_tool)
workflow.add_node("generate", generate_answer)
workflow.add_node("check_hallucination", check_hallucination)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_docs")
workflow.add_conditional_edges(
    "grade_docs",
    lambda state: "web_search" if state['needs_web_search'] else "generate"
)
workflow.add_edge("web_search", "generate")
workflow.add_edge("generate", "check_hallucination")
workflow.add_conditional_edges(
    "check_hallucination",
    lambda state: "generate" if state['generation_count'] < 3 else END
)

app = workflow.compile()

7.2 Conversational RAG with Memory

from collections import deque

class ConversationalRAG:
    def __init__(self, retriever, generator, max_history=5):
        self.retriever = retriever
        self.generator = generator
        self.conversation_history = deque(maxlen=max_history * 2)
    
    def _build_contextualized_query(self, question: str) -> str:
        """Rewrite query using conversation history"""
        if not self.conversation_history:
            return question
        
        history_str = "\n".join([
            f"{'User' if i%2==0 else 'Assistant'}: {msg}"
            for i, msg in enumerate(self.conversation_history)
        ])
        
        prompt = f"""Given this conversation history:
        {history_str}
        
        Rewrite the follow-up question as a standalone question:
        Follow-up: {question}
        Standalone question:"""
        
        return llm.invoke(prompt).content.strip()
    
    def chat(self, user_message: str) -> str:
        # Contextualize query
        standalone_q = self._build_contextualized_query(user_message)
        
        # Retrieve
        chunks = self.retriever.search(standalone_q, top_k=4)
        
        # Generate with history context
        answer = self.generator.generate_with_history(
            question=user_message,
            chunks=chunks,
            history=list(self.conversation_history)
        )
        
        # Update history
        self.conversation_history.append(user_message)
        self.conversation_history.append(answer)
        
        return answer

7.3 Multi-Modal RAG

# Handle images, tables, charts alongside text

class MultiModalRAG:
    def __init__(self):
        self.text_embedder = SentenceTransformer('BAAI/bge-large-en')
        self.image_embedder = CLIPModel.from_pretrained('openai/clip-vit-large-patch14')
        self.vision_llm = OpenAI()  # GPT-4V
    
    def ingest_pdf_with_images(self, pdf_path: str):
        """Extract text, tables, and images from PDF"""
        # Use LlamaParse or Unstructured for extraction
        elements = partition_pdf(
            filename=pdf_path,
            strategy='hi_res',
            extract_images_in_pdf=True,
            infer_table_structure=True
        )
        
        for element in elements:
            if element.type == 'Table':
                # Convert to markdown, embed as text
                table_text = element.metadata.text_as_html
                self.index_text(table_text, {'type': 'table'})
            
            elif element.type == 'Image':
                # Embed with CLIP, store base64
                image_embedding = self.encode_image(element.metadata.image_path)
                self.index_image(image_embedding, element.metadata)
            
            else:
                self.index_text(element.text, {'type': 'text'})
    
    def query(self, question: str, image=None) -> str:
        """Query with optional image input"""
        # Text retrieval
        text_chunks = self.text_index.search(question, top_k=3)
        
        # Image retrieval (if query relates to visual content)
        if self.is_visual_query(question):
            image_chunks = self.image_index.search(question, top_k=2)
        
        # Compose multi-modal context for GPT-4V
        messages = self.build_multimodal_prompt(
            question, text_chunks, image_chunks
        )
        
        return self.vision_llm.chat.completions.create(
            model='gpt-4o', messages=messages
        ).choices[0].message.content

7.4 Knowledge Graph RAG

# Neo4j + LLM for structured knowledge retrieval

from neo4j import GraphDatabase
import spacy

class KnowledgeGraphRAG:
    def __init__(self, neo4j_uri, neo4j_auth):
        self.driver = GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
        self.nlp = spacy.load('en_core_web_lg')
        self.llm = OpenAI()
    
    def ingest_to_graph(self, text: str):
        """Extract entities and relationships, store in Neo4j"""
        doc = self.nlp(text)
        
        with self.driver.session() as session:
            # Create entities
            for ent in doc.ents:
                session.run(
                    "MERGE (e:Entity {name: $name, type: $type})",
                    name=ent.text, type=ent.label_
                )
            
            # Create relationships (simplified β€” use LLM for better extraction)
            for sent in doc.sents:
                self.extract_and_store_relations(session, sent.text)
    
    def cypher_query_from_nl(self, question: str) -> str:
        """Convert natural language to Cypher using LLM"""
        prompt = f"""Convert this question to a Neo4j Cypher query.
        Graph has: (Entity {{name, type}}) and [:RELATES_TO {{relation}}] edges.
        
        Question: {question}
        Cypher query:"""
        
        return self.llm.chat.completions.create(
            model='gpt-4o',
            messages=[{"role": "user", "content": prompt}]
        ).choices[0].message.content.strip()
    
    def query(self, question: str) -> str:
        # Try structured graph query
        cypher = self.cypher_query_from_nl(question)
        
        with self.driver.session() as session:
            graph_results = session.run(cypher).data()
        
        # Also do vector retrieval
        vector_results = self.vector_retriever.search(question)
        
        # Combine both
        combined_context = self.format_graph_results(graph_results) + \
                          self.format_vector_results(vector_results)
        
        return self.generator.generate(question, combined_context)

8. BUILDING YOUR OWN RAG SERVICE

8.1 System Design β€” Complete Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  CLIENT LAYER                        β”‚
β”‚  Web App  |  Mobile  |  Slack Bot  |  API Clients   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚ HTTPS
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  API GATEWAY                         β”‚
β”‚  (Kong / AWS API Gateway / Nginx)                   β”‚
β”‚  Rate limiting, Auth (JWT/OAuth), Routing           β”‚
└──────┬──────────────┬──────────────┬───────────────-β”˜
       β”‚              β”‚              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Query API  β”‚ β”‚ Ingest API β”‚ β”‚  Admin API    β”‚
β”‚  (FastAPI)  β”‚ β”‚ (FastAPI)  β”‚ β”‚  (FastAPI)    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                RAG ORCHESTRATION LAYER               β”‚
β”‚  Query Preprocessing β†’ Retrieval β†’ Reranking β†’      β”‚
β”‚  Context Assembly β†’ Generation β†’ Post-processing    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  RETRIEVAL      β”‚    β”‚  GENERATION                   β”‚
β”‚  ─────────────  β”‚    β”‚  ──────────────────────────   β”‚
β”‚  Qdrant/Milvus  β”‚    β”‚  vLLM (LLaMA/Mistral)        β”‚
β”‚  (Vector Store) β”‚    β”‚  or OpenAI/Anthropic API      β”‚
β”‚                 β”‚    β”‚                               β”‚
β”‚  Elasticsearch  β”‚    β”‚  Prompt Templates             β”‚
β”‚  (BM25 Search)  β”‚    β”‚  Context Compression          β”‚
β”‚                 β”‚    β”‚  Response Streaming           β”‚
β”‚  Redis          β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚  (Semantic Cache)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  DATA LAYER                          β”‚
β”‚  PostgreSQL (metadata) | S3/GCS (raw docs)          β”‚
β”‚  Redis (sessions/cache) | Neo4j (knowledge graph)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               OBSERVABILITY LAYER                    β”‚
β”‚  Prometheus + Grafana | Langfuse | Sentry           β”‚
β”‚  OpenTelemetry | ELK Stack (logs)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

8.2 Docker Compose for Full Stack

# docker-compose.yml
version: '3.8'
services:
  rag-api:
    build: ./api
    ports: ["8000:8000"]
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379
      - POSTGRES_URL=postgresql://user:pass@postgres:5432/ragdb
    depends_on: [qdrant, redis, postgres]
  
  qdrant:
    image: qdrant/qdrant:latest
    ports: ["6333:6333", "6334:6334"]
    volumes: ["./qdrant_data:/qdrant/storage"]
  
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
  
  postgres:
    image: pgvector/pgvector:pg16
    ports: ["5432:5432"]
    environment:
      POSTGRES_DB: ragdb
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
    volumes: ["./postgres_data:/var/lib/postgresql/data"]
  
  nginx:
    image: nginx:alpine
    ports: ["80:80", "443:443"]
    volumes: ["./nginx.conf:/etc/nginx/nginx.conf"]
    depends_on: [rag-api]
  
  langfuse:
    image: langfuse/langfuse:latest
    ports: ["3000:3000"]
    environment:
      - DATABASE_URL=postgresql://user:pass@postgres:5432/langfuse

8.3 Semantic Caching

import redis
import numpy as np
from sentence_transformers import SentenceTransformer
import json

class SemanticCache:
    """Cache responses for semantically similar queries"""
    
    def __init__(self, threshold=0.95, ttl=3600):
        self.redis = redis.Redis(host='localhost', port=6379, decode_responses=False)
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = threshold
        self.ttl = ttl
    
    def _get_cache_keys(self):
        return [k.decode() for k in self.redis.keys("cache:*")]
    
    def get(self, query: str) -> dict | None:
        query_emb = self.embedder.encode([query])[0]
        cache_keys = self._get_cache_keys()
        
        for key in cache_keys:
            cached = self.redis.get(key)
            if not cached:
                continue
            
            data = json.loads(cached)
            cached_emb = np.array(data['embedding'])
            
            # Cosine similarity
            sim = np.dot(query_emb, cached_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(cached_emb)
            )
            
            if sim >= self.threshold:
                return data['response']  # Cache hit!
        
        return None  # Cache miss
    
    def set(self, query: str, response: dict):
        embedding = self.embedder.encode([query])[0].tolist()
        key = f"cache:{hash(query)}"
        
        self.redis.setex(
            key,
            self.ttl,
            json.dumps({'embedding': embedding, 'response': response})
        )

9. CUTTING-EDGE DEVELOPMENTS (2024-2025)

9.1 Long-Context vs RAG Debate

  • Gemini 1.5 Pro: 1M+ token context window
  • Claude 3.5: 200K context
  • The reality: RAG still wins for large corpora (billions of tokens), cost efficiency, and dynamic updates
  • Hybrid approach: RAG for retrieval, long context for reasoning over retrieved docs
  • Lost-in-the-middle problem: LLMs struggle with middle-of-context info; RAG helps by limiting context

9.2 Late Chunking (2024)

  • Embed full documents, then chunk embeddings (not text)
  • Preserves full document context in each chunk embedding
  • Jina AI approach: `jina-embeddings-v3`
  • Better than traditional chunk-first-embed-second

9.3 Contextual Retrieval (Anthropic, 2024)

  • Prepend context to each chunk before embedding
  • Prompt: "Here is a document: {DOCUMENT}. Please give a short context for this chunk: {CHUNK_CONTENT}"
  • Reduces retrieval failures by 49%
  • Combined with BM25: 67% reduction in failures

9.4 Speculative RAG (2024)

  • Smaller model generates draft answer + reasoning
  • Larger model verifies and refines
  • 2-4x faster than single large model RAG

9.5 RAG Fusion & Adaptive RAG

  • Multiple retrieval strategies fused with learned weights
  • Adaptive: LLM decides retrieval strategy per query
  • FLARE: retrieve only when generation uncertainty is high
  • Self-RAG: generate, critique, and regenerate

9.6 Multimodal RAG (2024-2025)

  • ColPali: PDF retrieval using vision encoder (no text extraction needed!)
    • Embed PDF pages as images using PaliGemma
    • Retrieve relevant pages, feed to multimodal LLM
  • Video RAG: temporal grounding in video content
  • Audio RAG: whisper transcription + speaker diarization

9.7 Structured Output & Tool-Augmented RAG

  • LLM generates SQL/Cypher to query databases
  • NL2SQL: Text-to-SQL for structured data
  • Tool-augmented RAG: retrieval + calculation + code execution
  • Instructor library for guaranteed JSON output

9.8 Embedding Innovations (2025)

  • Matryoshka embeddings (MRL): single model, multiple dimensions
  • Binary quantization with rescoring: 40x faster, 0.3% accuracy loss
  • Int8 quantization: 2x faster, negligible accuracy loss
  • Multi-vector embeddings: multiple vectors per document (ColBERT-style)

9.9 Open-Source RAG Stacks (2025)

  • R2R (SciPhi): Production RAG framework with built-in analytics
  • Verba (Weaviate): Golden RAGie, complete open-source RAG app
  • RAGFlow: Deep document understanding RAG
  • Cognita (Truefoundry): Modular RAG framework
  • Kotaemon: Document QA with citations
  • AnythingLLM: All-in-one self-hosted RAG desktop

9.10 RAG + Agents (2025 Trend)

  • OpenAI Deep Research: Multi-step web RAG with reasoning
  • Perplexity Sonar: Real-time RAG with citations
  • You.com Research: Agent-based RAG pipeline
  • Trend: RAG evolving into full agentic research systems

10. PROJECT IDEAS: BEGINNER TO ADVANCED

🟒 BEGINNER LEVEL (Week 1-4)

Project 1: Personal Document Chatbot

  • Goal: Chat with your own PDFs/notes
  • Tech: LangChain + OpenAI + ChromaDB + Streamlit
  • Steps:
    • Upload PDF via Streamlit UI
    • Parse with PyPDF2
    • Chunk with RecursiveCharacterTextSplitter
    • Embed with OpenAI embeddings
    • Store in ChromaDB (local)
    • Query with conversational chain
  • Skills Learned: Basic RAG pipeline, UI creation
  • Time: 2-3 days

Project 2: FAQ Bot for a Website

  • Goal: Answer questions from a website's content
  • Tech: Scrapy + Sentence-Transformers + FAISS + FastAPI
  • Steps:
    • Scrape website content
    • Clean and chunk HTML
    • Embed with MiniLM
    • Build FAISS index
    • Create FastAPI endpoint
    • Return top-3 answers with sources
  • Skills Learned: Web scraping, REST API, FAISS
  • Time: 3-5 days

Project 3: Local AI Assistant (Fully Offline)

  • Goal: RAG system with no API costs
  • Tech: Ollama (LLaMA 3.1 8B) + ChromaDB + nomic-embed-text
  • Steps:
    • Install Ollama, pull LLaMA 3.1 8B
    • Use Ollama for embeddings (nomic-embed-text)
    • Build local ChromaDB index
    • Chat interface with Gradio
  • Skills Learned: Local LLMs, privacy-first RAG
  • Time: 1-2 days

🟑 INTERMEDIATE LEVEL (Week 5-12)

Project 4: Advanced Legal/Medical Document RAG

  • Goal: High-accuracy domain-specific RAG with citations
  • Tech: LlamaIndex + Qdrant + BGE Reranker + GPT-4
  • Features:
    • Semantic chunking for legal documents
    • Hybrid BM25 + dense retrieval
    • Cross-encoder reranking
    • Page-level citations
    • Confidence scores
    • "I don't know" detection
  • Skills Learned: Domain RAG, hybrid retrieval, citations
  • Time: 1-2 weeks

Project 5: Multi-Document Research Assistant

  • Goal: Compare and synthesize across 100+ documents
  • Tech: LangGraph + HyDE + Multi-Query + Cohere Rerank
  • Features:
    • Upload multiple documents
    • Cross-document synthesis
    • Contradiction detection
    • Source attribution matrix
    • Export to report
  • Skills Learned: Agentic RAG, complex synthesis
  • Time: 2 weeks

Project 6: Conversational RAG with Memory

  • Goal: Chat that remembers past conversations
  • Tech: LangChain + PostgreSQL + pgvector + Redis
  • Features:
    • User sessions and history
    • Query contextualization
    • Long-term memory storage in pg
    • Short-term session cache in Redis
    • Personal knowledge base per user
  • Skills Learned: Conversational AI, session management
  • Time: 2 weeks

Project 7: Code Documentation RAG

  • Goal: Chat with a large codebase
  • Tech: Tree-sitter + BGE + Qdrant + Claude
  • Features:
    • Parse code into semantic chunks (function/class level)
    • Include docstrings and comments
    • Dependency graph extraction
    • "How does X work?" β†’ returns relevant code + explanation
  • Skills Learned: Code understanding, AST parsing
  • Time: 1-2 weeks

πŸ”΄ ADVANCED LEVEL (Week 13-24)

Project 8: Production RAG SaaS

  • Goal: Multi-tenant RAG service with billing
  • Tech: FastAPI + Qdrant + vLLM + Stripe + Auth0 + K8s
  • Features:
    • Multi-tenant isolation (namespace per user)
    • Rate limiting and quota management
    • Usage-based billing with Stripe
    • Admin dashboard
    • Webhook for document events
    • SLA monitoring
    • Auto-scaling based on load
  • Skills Learned: SaaS architecture, multitenancy, DevOps
  • Time: 4-6 weeks

Project 9: Real-Time RAG with Web Search

  • Goal: Answer questions with live web data
  • Tech: Tavily/SerpAPI + LangGraph + GPT-4 + Streaming
  • Features:
    • Combine internal docs with web search
    • CRAG pattern (evaluate, search web if needed)
    • Streaming responses (SSE)
    • Source freshness scoring
    • Fact verification step
  • Skills Learned: Agentic RAG, streaming, web augmentation
  • Time: 3 weeks

Project 10: GraphRAG Knowledge System

  • Goal: Interconnected knowledge with graph traversal
  • Tech: Neo4j + SpaCy + LangChain + GPT-4
  • Features:
    • Entity + relationship extraction
    • Community detection for summarization
    • Graph-vector hybrid retrieval
    • Relationship-aware answers
    • Knowledge graph visualization
  • Skills Learned: Knowledge graphs, NLP, graph databases
  • Time: 4-6 weeks

Project 11: Fine-Tuned Embedding + RAG Pipeline

  • Goal: Custom embedding model for your domain
  • Tech: Sentence-Transformers + MTEB + Qdrant
  • Steps:
    • Collect domain Q&A pairs (1000+ examples)
    • Fine-tune MiniLM with MNRL loss
    • Evaluate on BEIR
    • Deploy fine-tuned model
    • Compare vs generic embeddings
  • Skills Learned: Embedding fine-tuning, MTEB evaluation
  • Time: 3 weeks

Project 12: MultiModal RAG with ColPali

  • Goal: Search PDFs using visual understanding (no OCR!)
  • Tech: ColPali + PaliGemma + GPT-4V + Qdrant
  • Features:
    • Index PDF pages as images
    • Visual similarity search
    • Answer questions about charts/tables/diagrams
    • No text extraction required
  • Skills Learned: Vision models, multimodal search
  • Time: 3-4 weeks

11. REVERSE ENGINEERING EXISTING SYSTEMS

11.1 How to Reverse Engineer RAG Products

Step 1: Black-Box Testing
- Send queries and observe:
  - Response latency (retrieval time hint)
  - Citation format (chunk size hint)
  - "I don't know" behavior
  - Max context length behavior
  - Streaming vs batch response
  - Error messages (reveal stack)
Step 2: Analyze Behavior Patterns
Perplexity.ai analysis:
  - Always cites web sources β†’ live web search
  - Fast response β†’ parallel retrieval + small reranker
  - Shows source snippets β†’ 200-500 token chunks
  - Sometimes "searches for" β†’ agentic step visible
  - Exact quote matching β†’ BM25 + dense hybrid

ChatGPT with file upload:
  - 512-1024 token chunks (context window visible)
  - Summarizes long docs β†’ retrieval + summarization
  - Loses information in large files β†’ fixed context budget
Step 3: Reconstruct Architecture
# Reconstruct Perplexity-like system:
class PerplexityClone:
    def query(self, question: str) -> str:
        # 1. Classify query intent
        intent = self.classify(question)  # factual, conversational, code
        
        # 2. Generate search queries
        queries = self.generate_search_queries(question, n=3)
        
        # 3. Parallel web search
        results = asyncio.gather(*[
            self.web_search(q) for q in queries
        ])
        
        # 4. Parse and chunk results
        chunks = self.parse_search_results(results)
        
        # 5. Rerank
        ranked = self.reranker.rerank(question, chunks)
        
        # 6. Generate with citations
        return self.generate_with_citations(question, ranked[:5])

11.2 Reverse Engineering Specific Systems

Notion AI
Observations:
- Context-aware (knows current page)
- Generates in Notion format (markdown blocks)
- Personal workspace knowledge

Likely Architecture:
- Workspace indexed per user in tenant-isolated vector store
- Block-level chunking (Notion's atomic units)
- Metadata filtering by workspace/page/user
- Fine-tuned generation for Notion markdown output
GitHub Copilot
Observations:
- Uses surrounding code context
- Repository-wide understanding
- Language-specific knowledge

Likely Architecture:
- File-level and function-level chunking by AST
- BM25 on identifiers + dense on semantics
- Sliding window context of open files
- Fill-in-the-middle (FIM) trained model
- Repository-level RAG for cross-file context

12. PRODUCTION DEPLOYMENT & MLOPS

12.1 RAG Pipeline Testing

# Unit testing RAG components
import pytest
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
)
from datasets import Dataset

class TestRAGPipeline:
    def test_retrieval_recall(self):
        """Ensure retrieval finds known-relevant docs"""
        test_cases = [
            {
                "query": "What is the refund policy?",
                "expected_doc_id": "policy_doc_001"
            }
        ]
        for case in test_cases:
            results = retriever.search(case['query'], top_k=5)
            ids = [r['id'] for r in results]
            assert case['expected_doc_id'] in ids
    
    def test_no_hallucination(self):
        """Answers must be grounded in context"""
        question = "What is the capital of France?"
        context = ["France is a country in Western Europe."]  # No capital mentioned
        answer = generator.generate(question, context)
        
        # Should say "not in context" not "Paris"
        assert "not" in answer.lower() or "don't" in answer.lower()
    
    def test_ragas_metrics(self):
        """Run RAGAS evaluation on test set"""
        data = Dataset.from_dict({
            "question": test_questions,
            "answer": generated_answers,
            "contexts": retrieved_contexts,
            "ground_truth": ground_truth_answers
        })
        
        results = evaluate(data, metrics=[
            faithfulness, answer_relevancy,
            context_recall, context_precision
        ])
        
        assert results['faithfulness'] > 0.85
        assert results['context_precision'] > 0.75

12.2 CI/CD Pipeline for RAG

# .github/workflows/rag-pipeline.yml
name: RAG Pipeline CI
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with: {python-version: '3.11'}
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run unit tests
        run: pytest tests/unit
      - name: Run integration tests
        run: pytest tests/integration
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Evaluate RAG quality
        run: python scripts/evaluate_rag.py --threshold 0.8
      - name: Build Docker image
        run: docker build -t rag-service:${{ github.sha }} .
      - name: Deploy to staging
        if: github.ref == 'refs/heads/main'
        run: ./scripts/deploy.sh staging

12.3 Monitoring & Alerting

# Prometheus metrics for RAG service
from prometheus_client import Counter, Histogram, Gauge

rag_requests_total = Counter(
    'rag_requests_total',
    'Total RAG requests',
    ['status', 'route']
)

rag_latency_seconds = Histogram(
    'rag_latency_seconds',
    'RAG request latency',
    ['stage'],  # retrieval, reranking, generation, total
    buckets=[0.1, 0.3, 0.5, 1.0, 2.0, 5.0, 10.0]
)

retrieved_chunks_gauge = Gauge(
    'retrieved_chunks_count',
    'Number of chunks retrieved per request'
)

hallucination_rate = Counter(
    'rag_hallucinations_detected',
    'Responses flagged as hallucinations'
)

# Alert rules (Grafana/AlertManager)
ALERT_RULES = {
    "high_latency": "p99 > 5s for 5min",
    "low_faithfulness": "faithfulness < 0.7 for 10min",
    "high_error_rate": "errors > 5% for 2min",
    "vector_db_down": "qdrant_health = 0 for 1min"
}

13. RESEARCH PAPERS & RESOURCES

13.1 Foundational Papers

  1. Lewis et al. (2020) β€” "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"
    • The original RAG paper (Facebook AI)
  2. Karpukhin et al. (2020) β€” "Dense Passage Retrieval for Open-Domain Question Answering" (DPR)
  3. Izacard & Grave (2021) β€” "Leveraging Passage Retrieval with Generative Models for Open Domain QA" (FiD)
  4. Khattab & Zaharia (2020) β€” "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction"
  5. Robertson & Zaragoza (2009) β€” "The Probabilistic Relevance Framework: BM25 and Beyond"
  6. Malkov & Yashunin (2016) β€” "Efficient and Robust Approximate Nearest Neighbor Search Using HNSW"
  7. Gao et al. (2022) β€” "Precise Zero-Shot Dense Retrieval without Relevance Labels" (HyDE)

13.2 Advanced Papers (2023-2025)

  1. Asai et al. (2023) β€” "Self-RAG: Learning to Retrieve, Generate, and Critique"
  2. Shi et al. (2023) β€” "REPLUG: Retrieval-Augmented Black-Box Language Models"
  3. Edge et al. (2024) β€” "From Local to Global: A GraphRAG Approach" (Microsoft)
  4. Sarthi et al. (2024) β€” "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval"
  5. Yan et al. (2024) β€” "Corrective Retrieval Augmented Generation" (CRAG)
  6. Faysse et al. (2024) β€” "ColPali: Efficient Document Retrieval with Vision Language Models"
  7. Su et al. (2024) β€” "Contextual Retrieval" (Anthropic Blog)
  8. Zhao et al. (2024) β€” "Retrieval-Augmented Generation for AI-Generated Content: A Survey"

13.3 Learning Resources

Courses:
  • DeepLearning.AI: "Building and Evaluating Advanced RAG" (free)
  • DeepLearning.AI: "LangChain for LLM Application Development"
  • DeepLearning.AI: "Vector Databases: from Embeddings to Applications"
  • fast.ai: "Practical Deep Learning for Coders"
  • Hugging Face: NLP Course (free, comprehensive)
Books:
  • "Hands-On Large Language Models" β€” Jay Alammar & Maarten Grootendorst (2024)
  • "Building LLM Apps" β€” Valentina Alto (2024)
  • "The NLP Practitioner's Handbook" (multiple authors)
  • "Designing Machine Learning Systems" β€” Chip Huyen
YouTube Channels:
  • Andrej Karpathy β€” deep model understanding
  • AI Explained β€” RAG and LLM news
  • Sam Witteveen β€” practical LLM tutorials
  • James Briggs β€” RAG tutorials
Communities:
  • r/LocalLLaMA (Reddit) β€” self-hosted focus
  • Hugging Face Discord β€” model discussions
  • LangChain Discord β€” framework help
  • LlamaIndex Discord β€” RAG specific
Benchmarks:
  • MTEB β€” embedding model benchmark
  • BEIR β€” IR benchmark
  • RAGAS β€” RAG evaluation
  • LLM-as-Judge benchmarks
  • LMSYS Chatbot Arena

Quick-Start Checklist

Week 1-2: Get Your First RAG Working
  • [ ] Install: `pip install langchain openai chromadb sentence-transformers`
  • [ ] Get OpenAI API key
  • [ ] Run basic RAG on 3 PDF files
  • [ ] Understand the 3 components: chunk β†’ embed β†’ retrieve β†’ generate
Week 3-4: Level Up Retrieval
  • [ ] Implement BM25 with `rank_bm25`
  • [ ] Try `BAAI/bge-large-en-v1.5` embedding model
  • [ ] Set up Qdrant locally with Docker
  • [ ] Add cross-encoder reranking
Month 2: Advanced Patterns
  • [ ] Implement HyDE
  • [ ] Add multi-query retrieval
  • [ ] Build conversational RAG with history
  • [ ] Evaluate with RAGAS
Month 3: Production Ready
  • [ ] FastAPI service with proper error handling
  • [ ] Semantic cache with Redis
  • [ ] Langfuse tracing
  • [ ] Docker Compose deployment
  • [ ] CI/CD pipeline
Month 4-6: Own the Stack
  • [ ] Fine-tune embeddings on domain data
  • [ ] Self-host LLM with vLLM
  • [ ] Build agentic RAG with LangGraph
  • [ ] Deploy to Kubernetes
  • [ ] Monitor with Prometheus + Grafana