🧠 Complete LLM Development Roadmap

Building Your Own Large Language Model & AI Service (Like Claude, Gemini, ChatGPT)

Scope: This roadmap covers everything from foundational math to deploying a production-grade LLM service β€” structured learning paths, algorithms, architecture, hardware, reverse engineering, and cutting-edge developments.
Last Updated: 2025  |  Covers models through Llama 3.1, DeepSeek-V3, DeepSeek-R1, Gemini 1.5, GPT-4o, Claude 3.5  |  Total Estimated Effort: 1500–3000 hours of focused study and implementation

1. Foundation Prerequisites

1.1 Programming Languages

  • Python (Primary Language)
    • OOP, functional programming, decorators, generators
    • Async/await, multiprocessing, threading
    • Memory management, profiling, optimization
    • Type hints, dataclasses, abstract classes
  • C/C++ (Performance-critical components)
    • Pointers, memory allocation, RAII
    • CUDA extensions, custom kernels
  • CUDA (GPU Programming)
    • Thread blocks, warps, shared memory
    • Memory coalescing, kernel optimization
  • Bash/Shell (DevOps, automation)
  • SQL (Data management)
  • Rust (Optional β€” emerging for inference engines)

1.2 Computer Science Fundamentals

  • Data Structures: Arrays, Trees, Graphs, Hash Tables, Heaps
  • Algorithms: Sorting, Searching, Dynamic Programming, Graph algorithms
  • Complexity Analysis: Big-O notation, space/time tradeoffs
  • Distributed Systems: CAP theorem, consensus algorithms, sharding
  • Operating Systems: Process management, memory paging, I/O
  • Computer Networks: TCP/IP, HTTP/2, gRPC, WebSockets
  • Databases: Relational (PostgreSQL), NoSQL (MongoDB, Redis), Vector DBs

1.3 Software Engineering Practices

  • Version Control: Git, GitHub, branching strategies
  • Testing: Unit, integration, regression, load testing
  • CI/CD: GitHub Actions, Jenkins, Docker, Kubernetes
  • Design Patterns: Factory, Observer, Strategy, Pipeline
  • API Design: REST, GraphQL, gRPC
  • Containerization: Docker Compose, Kubernetes orchestration

2. Mathematics & Statistics Deep Dive

2.1 Linear Algebra (Most Critical)

  • Vectors & Spaces
    • Vector operations, dot products, cross products
    • Vector spaces, basis, span, linear independence
    • Subspaces, null space, column space
  • Matrices
    • Matrix multiplication, transpose, inverse
    • Rank, determinant, trace
    • Special matrices: diagonal, orthogonal, symmetric, positive definite
  • Eigendecomposition
    • Eigenvalues, eigenvectors, characteristic polynomial
    • Diagonalization, spectral theorem
    • Power iteration, QR algorithm
  • Singular Value Decomposition (SVD)
    • Full vs. truncated SVD
    • Applications in dimensionality reduction, LoRA
    • Relationship to PCA
  • Tensor Operations
    • Higher-order tensors, tensor contractions
    • Einstein summation notation (einsum)
    • Tensor decomposition (Tucker, CP)
  • Norms & Distances
    • L1, L2, Frobenius, nuclear norms
    • Cosine similarity, KL divergence as distance

2.2 Calculus & Optimization

  • Differential Calculus
    • Derivatives, partial derivatives, directional derivatives
    • Chain rule, product rule, quotient rule
    • Jacobian matrix, Hessian matrix
    • Taylor series expansion
  • Integral Calculus
    • Definite/indefinite integrals
    • Fundamental theorem of calculus
    • Numerical integration (quadrature)
  • Multivariable Calculus
    • Gradient, divergence, curl
    • Lagrange multipliers, constrained optimization
    • Vector fields and flow
  • Optimization Theory
    • Convex vs. non-convex optimization
    • First and second-order optimality conditions
    • Saddle points, local vs. global minima
    • Lagrangian relaxation, KKT conditions

2.3 Probability & Statistics

  • Probability Theory
    • Probability spaces, sample spaces, events
    • Conditional probability, Bayes' theorem
    • Law of large numbers, central limit theorem
    • Moment generating functions
  • Probability Distributions
    • Discrete: Bernoulli, Binomial, Poisson, Categorical
    • Continuous: Gaussian, Uniform, Beta, Dirichlet, Laplace
    • Multivariate distributions, covariance matrices
  • Information Theory
    • Entropy, cross-entropy, joint entropy
    • Kullback-Leibler (KL) divergence
    • Mutual information, Jensen-Shannon divergence
    • Minimum description length
  • Statistical Estimation
    • Maximum likelihood estimation (MLE)
    • Maximum a posteriori (MAP)
    • Bayesian inference, prior/posterior
    • Expectation-Maximization (EM) algorithm
  • Sampling Methods
    • Monte Carlo sampling
    • Markov Chain Monte Carlo (MCMC)
    • Importance sampling
    • Temperature sampling, top-k, top-p (nucleus sampling)

2.4 Numerical Methods

  • Floating point arithmetic, precision issues (fp16, bf16, fp32)
  • Numerical stability, gradient clipping
  • Fast Fourier Transform (FFT)
  • Sparse matrix operations
  • Iterative solvers (conjugate gradient)

3. Machine Learning Fundamentals

3.1 Core Concepts

  • Supervised, Unsupervised, Semi-supervised, Self-supervised learning
  • Bias-variance tradeoff, overfitting, underfitting
  • Regularization: L1/L2, dropout, weight decay, early stopping
  • Cross-validation, hyperparameter tuning
  • Feature engineering, normalization, standardization

3.2 Classical Algorithms

  • Linear Regression, Logistic Regression
  • Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM)
  • Support Vector Machines (SVM), kernel trick
  • K-Nearest Neighbors (KNN)
  • Naive Bayes, Gaussian Mixture Models
  • PCA, t-SNE, UMAP (dimensionality reduction)
  • K-Means, DBSCAN, Hierarchical clustering

3.3 Gradient Descent & Optimizers

  • Vanilla Gradient Descent β€” full-batch, slow but stable
  • Stochastic Gradient Descent (SGD) β€” noisy but generalizes
  • Mini-batch SGD β€” industry standard balance
  • Momentum β€” exponential moving average of gradients
  • Nesterov Momentum β€” look-ahead momentum update
  • AdaGrad β€” per-parameter adaptive learning rate
  • RMSProp β€” decaying average of squared gradients
  • Adam β€” combines momentum + RMSProp
    • m_t = Ξ²1 * m_{t-1} + (1 - Ξ²1) * g_t
    • v_t = Ξ²2 * v_{t-1} + (1 - Ξ²2) * g_tΒ²
    • ΞΈ = ΞΈ - Ξ± * mΜ‚_t / (√vΜ‚_t + Ξ΅)
  • AdamW β€” Adam with decoupled weight decay (preferred for LLMs)
  • Lion β€” EvoLved Sign Momentum (Google, 2023)
  • Sophia β€” Second-order optimizer for LLMs
  • LAMB/LARS β€” Large-batch distributed training optimizers

3.4 Loss Functions

  • Mean Squared Error (MSE), Mean Absolute Error (MAE)
  • Cross-Entropy Loss (language modeling: next-token prediction)
  • Binary Cross-Entropy, Categorical Cross-Entropy
  • Contrastive Loss, Triplet Loss (for embeddings)
  • REINFORCE / Policy Gradient Loss (for RLHF)

4. Deep Learning Core

4.1 Neural Network Basics

  • Perceptron, Multi-layer Perceptron (MLP)
  • Activation functions:
    • ReLU: max(0, x) β€” dead neuron problem
    • GeLU: x * Ξ¦(x) β€” smooth, used in GPT/BERT
    • SiLU/Swish: x * sigmoid(x) β€” Llama uses this
    • Mish, ELU, Leaky ReLU
    • Softmax: e^xi / Ξ£e^xj β€” for probability distributions
  • Backpropagation algorithm, automatic differentiation
  • Weight initialization: Xavier/Glorot, He initialization, normal/uniform

4.2 Normalization Techniques

  • Batch Normalization β€” normalizes across batch dimension
    • ΞΌ_B = (1/m) Ξ£x_i; σ²_B = (1/m) Ξ£(x_i - ΞΌ_B)Β²
    • Problems with small batches, sequential data
  • Layer Normalization β€” normalizes across feature dimension
    • Used in all modern Transformers
    • LN(x) = (x - ΞΌ) / Οƒ * Ξ³ + Ξ²
  • RMS Normalization (RMSNorm) β€” simplified LayerNorm
    • RMSNorm(x) = x / RMS(x) * Ξ³; no mean subtraction
    • Used in Llama, Mistral β€” more efficient
  • Group Normalization β€” between BatchNorm and LayerNorm
  • Pre-Norm vs. Post-Norm β€” Pre-Norm (before attention) is more stable for deep networks

4.3 Regularization Deep Dive

  • Dropout β€” randomly zero out neurons during training
  • DropPath/Stochastic Depth β€” drop entire residual paths
  • Label Smoothing β€” soften hard labels to prevent overconfidence
  • Weight Decay (L2) β€” penalize large weights
  • Gradient Clipping β€” cap gradient norm to prevent explosion
  • Mixup / CutMix β€” data augmentation regularizers

4.4 CNN, RNN, LSTM (Pre-Transformer Context)

  • Convolutional Neural Networks β€” local feature extraction
  • Recurrent Neural Networks β€” sequential dependencies
    • Vanishing/exploding gradient problem
  • Long Short-Term Memory (LSTM) β€” gating mechanisms
  • Gated Recurrent Unit (GRU) β€” simplified LSTM
  • Seq2Seq with attention β€” foundation of Transformers
  • Encoder-Decoder architecture β€” original MT framework

5. Natural Language Processing (NLP)

5.1 Text Preprocessing

  • Tokenization strategies:
    • Word-level: simple but large vocabulary
    • Character-level: small vocab but long sequences
    • Subword: best of both worlds
  • Byte-Pair Encoding (BPE) β€” GPT-2, GPT-4 tokenizer
    • Start with characters, merge most frequent pairs iteratively
    • Creates vocabulary of ~32k-100k tokens
  • WordPiece β€” BERT tokenizer, similar to BPE
  • SentencePiece β€” language-agnostic, Llama, T5
    • Unigram language model variant
  • Tiktoken β€” OpenAI's fast tokenizer library
  • Stop words, stemming, lemmatization (less used with LLMs)
  • Text normalization: lowercasing, Unicode handling

5.2 Word Embeddings (Pre-Transformer)

  • Word2Vec (2013, Google)
    • Skip-gram: predict context from center word
    • CBOW: predict center from context
    • Negative sampling optimization
  • GloVe β€” Global Vectors, co-occurrence matrix factorization
  • FastText β€” subword embeddings, handles OOV
  • ELMo β€” contextual embeddings from bidirectional LSTM
  • Semantic similarity, analogy tasks (king - man + woman = queen)

5.3 Classic NLP Tasks (Now handled end-to-end by LLMs)

  • Named Entity Recognition (NER)
  • Part-of-Speech (POS) tagging
  • Sentiment Analysis
  • Machine Translation
  • Summarization
  • Question Answering
  • Text Classification

6. Transformer Architecture β€” The Heart of LLMs

6.1 Original Transformer ("Attention Is All You Need" β€” 2017)

  • Input Embedding β€” token IDs β†’ dense vectors (dim d_model)
  • Positional Encoding β€” add position info since no recurrence
    • Sinusoidal: PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
    • Learnable positional embeddings (BERT, GPT)
  • Encoder Stack β€” bidirectional, used for understanding
  • Decoder Stack β€” autoregressive, used for generation
  • Cross-Attention β€” decoder attends to encoder outputs

6.2 Attention Mechanism β€” Complete Breakdown

Attention(Q, K, V) = softmax(QK^T / √d_k) * V

Where:
- Q = Query matrix (what we're looking for)
- K = Key matrix (what's available to match)
- V = Value matrix (what we actually retrieve)
- d_k = dimension of keys (scaling factor)
  • Self-Attention β€” Q, K, V all come from same input
  • Cross-Attention β€” Q from decoder, K/V from encoder
  • Causal/Masked Attention β€” mask future tokens (GPT-style)
    • Lower-triangular mask: M_ij = 0 if j > i, else -∞

6.3 Multi-Head Attention (MHA)

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O

head_i = Attention(Q*W_Q_i, K*W_K_i, V*W_V_i)
  • Each head learns different aspects of relationships
  • Typical: 8, 12, 16, 32, 64 heads
  • Parallel computation, concatenate then project

6.4 Feed-Forward Network (FFN)

FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

Or with GeLU:
FFN(x) = GeLU(xW_1 + b_1) * W_2 + b_2

SwiGLU variant (Llama):
FFN(x) = (SiLU(xW_1) * xW_3) * W_2
  • d_ff β‰ˆ 4 * d_model (hidden expansion)
  • Two-thirds are learnable; one-third is activation

6.5 Residual Connections & Layer Norm

  • Residual Connection: x = x + Sublayer(LN(x))
  • Pre-norm (before sublayer) β€” better gradient flow
  • Post-norm (after sublayer) β€” original paper style
  • Why residuals: prevent vanishing gradients in deep networks

6.6 Positional Encoding Evolution

  • Absolute Positional Encoding β€” fixed sinusoidal (original)
  • Learnable Absolute PE β€” BERT, GPT-2
  • Relative Positional Encoding β€” Transformer-XL
    • Encode distance between tokens, not absolute positions
  • ALiBi (Attention with Linear Biases) β€” linear penalty
    • Add bias -|i-j| * m to attention scores
    • Better length generalization than sinusoidal
  • RoPE (Rotary Position Embedding) β€” GPT-NeoX, Llama
    • Rotate Q and K vectors by angle proportional to position
    • RoPE(x, pos) = x * e^(i * ΞΈ * pos)
    • Excellent length generalization, used in most SOTA models
    • YaRN/LongRoPE β€” extend context of RoPE models
  • NoPE β€” no positional encoding, rely on attention patterns

6.7 Attention Variants & Optimizations

  • Multi-Query Attention (MQA) β€” single K, V shared across heads
    • Reduces KV cache size by num_heads factor (PaLM, Falcon)
  • Grouped Query Attention (GQA) β€” groups of heads share K, V
    • Balance between MHA quality and MQA efficiency (Llama 2/3)
  • Sliding Window Attention β€” each token attends to local window
    • Mistral uses 4096-token sliding window
  • Flash Attention β€” IO-aware exact attention algorithm
    • Tiles Q, K, V to fit in GPU SRAM
    • No materializing full NΓ—N attention matrix
    • 2-4x speedup, O(N) memory instead of O(NΒ²)
  • Flash Attention 2 & 3 β€” further optimizations for H100
  • PagedAttention β€” vLLM's memory-efficient KV cache paging
  • Ring Attention β€” distributes attention across devices for ultra-long sequences
  • Sparse Attention β€” attend to subset of tokens (Longformer, BigBird)
  • Linear Attention β€” approximate attention in O(N) time

7. Large Language Model Internals

7.1 Model Families & Architectures

  • Encoder-Only (BERT family)
    • Bidirectional context, MLM pre-training
    • Best for: classification, NER, embeddings
    • Examples: BERT, RoBERTa, DeBERTa, ELECTRA
  • Decoder-Only (GPT family) ← Most modern LLMs
    • Causal/autoregressive, CLM pre-training
    • Best for: generation, chat, reasoning
    • Examples: GPT-4, Claude, Llama, Mistral, Gemini
  • Encoder-Decoder (T5/Seq2Seq family)
    • Encoder reads input, decoder generates output
    • Best for: translation, summarization with source
    • Examples: T5, FLAN-T5, BART, mBART

7.2 Scaling Laws

  • Chinchilla Scaling Laws (Hoffmann et al., 2022)
    • Optimal: train tokens β‰ˆ 20Γ— number of parameters
    • N_optimal β‰ˆ 1.69 Γ— 10^9 Γ— C^0.49 (compute C in FLOPs)
    • C_optimal β‰ˆ 6ND (N params, D tokens)
    • GPT-3: undertrained; Chinchilla: same compute, more tokens
  • OpenAI Scaling Laws (Kaplan et al., 2020)
    • Loss scales as power law with compute, data, parameters
    • L(N) ~ N^{-0.076}; L(D) ~ D^{-0.095}
  • Emergent Abilities β€” appear suddenly at certain scales
    • In-context learning (~1B+), chain-of-thought (~100B+)
  • Neural Scaling Laws for LLM Inference
    • Larger model + fewer inference steps > smaller model + more steps

7.3 Context Window & Memory

  • Context window β€” maximum tokens model can process
  • KV Cache β€” cache Key and Value tensors during generation
    • Memory: 2 Γ— num_layers Γ— num_heads Γ— head_dim Γ— seq_len Γ— bytes_per_element
    • For Llama-3-70B, 100K context: ~30GB just for KV cache
  • KV Cache Compression
    • StreamingLLM β€” sink tokens + recent window
    • SnapKV β€” select important KV pairs
    • MLA (Multi-head Latent Attention) β€” DeepSeek's innovation
  • Positional interpolation β€” extend context beyond training length
  • Infini-attention β€” compressive memory for infinite context
  • Mamba/SSM β€” linear recurrence, O(1) memory per step

7.4 Tokenizer Design Details

  • Vocabulary size tradeoffs
    • Larger vocab: shorter sequences, faster, but larger embedding table
    • Smaller vocab: longer sequences, slower, smaller model
    • Typical: 32K (Llama 2), 128K (Llama 3, GPT-4), 256K (Gemini)
  • Special tokens: [BOS], [EOS], [PAD], [UNK], [MASK]
  • Chat templates: system/user/assistant turn formatting
    • Llama: <|begin_of_text|><|start_header_id|>system<|end_header_id|>...
    • ChatML: <|im_start|>system\n...<|im_end|>

7.5 Mixture of Experts (MoE)

  • Replace dense FFN with N expert FFNs + router
  • Top-K Routing: only K experts activated per token (K=1 or 2)
  • Load Balancing Loss: encourage equal use of all experts
  • Sparse MoE: Mixtral 8x7B, 8x22B β€” 8 experts, 2 active
  • Fine-grained MoE: DeepSeek-V3 β€” 256 experts, 8 active
  • Expert Choice routing β€” experts choose tokens (better balance)
  • Advantages: massive parameter count, same compute cost

8. Training Pipeline β€” From Scratch to Advance

8.1 Data Collection & Curation

Sources

  • Common Crawl β€” petabyte-scale web crawl (CC-Main, CC-News)
  • The Pile β€” EleutherAI's 825GB diverse dataset
  • RedPajama β€” open reproduction of LLaMA training data
  • ROOTS β€” multilingual BLOOM training data
  • Books: Project Gutenberg, Books3, BookCorpus
  • Code: GitHub (The Stack, StarCoder data), code contests
  • Scientific Papers: arXiv, PubMed, S2ORC
  • Wikipedia/Wikidata β€” high-quality factual text
  • StackExchange, Reddit β€” Q&A, discussion
  • Multilingual: CC-100, mC4, CulturaX

Data Processing Pipeline

Raw HTML/Text
    ↓
URL/Domain Filtering (dedup, quality domains)
    ↓
Language Identification (fastText, langdetect)
    ↓
Quality Filtering:
  - Perplexity filter (KenLM)
  - Heuristics (short docs, repetition ratio, symbol ratio)
  - ML classifiers (CCNet, Gopher quality filters)
    ↓
Deduplication:
  - Exact: MD5/SHA256 hashing
  - Near-duplicate: MinHash LSH (SimHash)
  - Semantic: embedding-based dedup
    ↓
PII Removal (emails, phone numbers, SSNs)
    ↓
Tokenization & Packing
    ↓
Binary format (numpy memmap, HDF5, WebDataset)

Data Mixing & Weighting

  • Domain weighting: upweight high-quality sources
  • Data mixing ratios (e.g., 80% web, 10% code, 5% books, 5% science)
  • Data flywheels: use trained model to filter better data
  • DSIR (Data Selection via Importance Resampling) β€” target-aware
  • DoReMi β€” automatic domain weight optimization

8.2 Model Architecture Configuration

Hyperparameter Selection Table

Model Size | d_model | n_layers | n_heads | d_ff    | Params
-----------|---------|----------|---------|---------|--------
125M       | 768     | 12       | 12      | 3072    | ~125M
1.3B       | 2048    | 24       | 16      | 8192    | ~1.3B
7B         | 4096    | 32       | 32      | 11008   | ~7B
13B        | 5120    | 40       | 40      | 13824   | ~13B
30B        | 6656    | 60       | 52      | 17920   | ~30B
70B        | 8192    | 80       | 64      | 28672   | ~70B
175B(GPT3) | 12288   | 96       | 96      | 49152   | ~175B

8.3 Pre-Training

Causal Language Modeling Objective

L_CLM = -Ξ£ log P(x_t | x_1, ..., x_{t-1})

For each sequence: predict next token given all previous tokens
Cross-entropy loss averaged over all positions

Training Configuration

  • Batch size: typically 256–4096 sequences
  • Sequence length: 2048–8192 tokens per sequence
  • Global batch size: micro_batch Γ— grad_accum Γ— world_size
  • Learning rate schedule:
    • Linear warmup (1000–2000 steps)
    • Cosine decay to lr_min = 0.1 Γ— lr_max
    • lr_max typically 1e-4 to 3e-4
  • Weight decay: 0.1 (AdamW standard)
  • Gradient clipping: clip at 1.0 norm
  • Ξ²1=0.9, Ξ²2=0.95, Ξ΅=1e-8 (Adam hyperparameters for LLMs)

Training Stability Techniques

  • Loss spikes: reduce lr, check data quality at spike step
  • Gradient norm monitoring: track throughout training
  • Loss divergence recovery: reload checkpoint, skip data batch
  • Z-loss regularization: penalize large logit magnitudes
  • QK Norm: normalize Q and K before attention score computation
  • Checkpoint averaging: average last N checkpoints for stability

8.4 Distributed Training

Data Parallelism (DP)

  • Replicate model on each GPU
  • Each GPU processes different batch
  • Synchronize gradients via AllReduce after backward
  • DDP (PyTorch), Horovod
  • FSDP (Fully Sharded Data Parallel) β€” ZeRO Stage 3

ZeRO (Zero Redundancy Optimizer) β€” DeepSpeed

Stage 0: Baseline DDP (model replicated)
Stage 1: Shard optimizer states across GPUs
Stage 2: Shard optimizer states + gradients
Stage 3: Shard optimizer states + gradients + parameters
         (full model sharding β€” needed for 70B+ on 8 GPUs)
ZeRO-Infinity: offload to CPU/NVMe for extreme scale

Tensor Parallelism (TP) β€” Megatron-LM

  • Split individual weight matrices across GPUs
  • Column parallel: split W along output dimension
  • Row parallel: split W along input dimension
  • Requires AllReduce at each forward/backward
  • Best for very large layers (d_model = 8192+)

Pipeline Parallelism (PP)

  • Assign layers to different GPUs/nodes
  • GPipe: micro-batches flow through pipeline
  • PipeDream: 1F1B (one forward, one backward) schedule
  • Bubble overhead: (p-1)/(m+p-1) for p stages, m micro-batches
  • Interleaved pipeline: reduces bubble, increases memory

Sequence Parallelism (SP)

  • Distribute long sequence across devices
  • Each device handles chunk of sequence length
  • Ring Attention: pass KV around ring of devices
  • Useful for 100K+ context training

3D Parallelism (Megatron-DeepSpeed)

  • Combine DP + TP + PP for training 100B+ models
  • Example: 175B on 1024 GPUs: DP=8, TP=8, PP=16

8.5 Mixed Precision Training

  • FP32 β€” full precision, safe but 2Γ— memory vs FP16
  • FP16 β€” 5-bit exponent, 10-bit mantissa, can overflow
  • BF16 β€” 8-bit exponent, 7-bit mantissa, same range as FP32
    • Preferred for LLM training (no loss scaling needed)
  • AMP (Automatic Mixed Precision):
    • Keep master weights in FP32
    • Forward/backward in FP16/BF16
    • Update master FP32 weights
  • FP8 Training β€” H100 native, needs careful scaling
    • Transformer Engine (NVIDIA) handles FP8 automatically

8.6 Checkpointing & Recovery

  • Save every N steps (N = 500–2000 typically)
  • Checkpoint includes: model weights, optimizer states, scheduler state, RNG state
  • Activation Checkpointing β€” recompute activations during backward to save memory
    • Trade 33% extra compute for ~10Γ— memory savings
  • Selective Activation Checkpointing β€” checkpoint only expensive ops
  • Distributed checkpoint sharding (each rank saves its own shard)

9. RLHF, Alignment & Fine-Tuning

9.1 Supervised Fine-Tuning (SFT)

  • Collect instruction-response pairs (hundreds of thousands)
  • Data formats:
    • Alpaca format: instruction/input/output
    • ShareGPT format: multi-turn conversations
    • FLAN/T0: task-specific instruction templates
  • Fine-tune with teacher forcing on completions only
  • Mask loss on prompt tokens, compute only on response
  • Key datasets: OpenAssistant, Dolly, FLAN, WizardLM, UltraChat

9.2 RLHF Pipeline (Reinforcement Learning from Human Feedback)

Step 1: SFT Model (instruction-following base)
    ↓
Step 2: Preference Data Collection
  Human annotators compare 2+ model outputs
  Rank: A > B or A = B
  Collect ~50K–1M comparisons
    ↓
Step 3: Reward Model Training
  Bradley-Terry model:
  L_RM = -E[log Οƒ(r(x, y_w) - r(x, y_l))]
  where y_w = preferred response, y_l = rejected
    ↓
Step 4: PPO Training
  Maximize: E[r(x, y)] - Ξ² * KL(Ο€_ΞΈ || Ο€_ref)
  KL penalty prevents model from deviating too far from SFT

PPO (Proximal Policy Optimization) Details

L_CLIP = E[min(r_t(ΞΈ) * A_t, clip(r_t(ΞΈ), 1-Ξ΅, 1+Ξ΅) * A_t)]

r_t(ΞΈ) = Ο€_ΞΈ(a_t|s_t) / Ο€_ΞΈ_old(a_t|s_t)  (probability ratio)
A_t = advantage estimate (GAE: Generalized Advantage Estimation)
Ξ΅ = 0.2 (clipping parameter)

Value function:
L_VF = E[(V_ΞΈ(s_t) - V_target)Β²]

Total loss:
L = L_CLIP - c1 * L_VF + c2 * S[Ο€_ΞΈ](s_t)  (entropy bonus)

9.3 DPO (Direct Preference Optimization) β€” Simpler RLHF

L_DPO = -E[log Οƒ(Ξ² * (log Ο€_ΞΈ(y_w|x)/Ο€_ref(y_w|x) - log Ο€_ΞΈ(y_l|x)/Ο€_ref(y_l|x)))]

Advantages over RLHF-PPO:
- No separate reward model needed
- More stable training
- Simpler implementation
- Comparable or better results
  • Variants: IPO, KTO, ORPO, SimPO, CPO

9.4 Parameter-Efficient Fine-Tuning (PEFT)

LoRA (Low-Rank Adaptation)

W = W_0 + Ξ”W = W_0 + BA

Where B ∈ R^(dΓ—r), A ∈ R^(rΓ—k), r << min(d,k)
Typically r = 4, 8, 16, 32, 64

Number of trainable params: r*(d+k) vs d*k
Reduction: (r*(d+k)) / (d*k) β‰ˆ r/min(d,k)
Example: 4096Γ—4096 layer, r=16: 99.8% param reduction
  • QLoRA β€” quantize base model to 4-bit, train LoRA adapters
    • NF4 quantization (Normal Float 4-bit)
    • Double quantization: quantize quantization constants too
    • Paged Optimizers: offload optimizer states to CPU RAM
  • LoRA+ β€” different learning rates for A and B matrices
  • DoRA β€” decompose into magnitude + direction components
  • LoRA-FA β€” frozen A matrix, only train B
  • AdaLoRA β€” adaptive rank allocation per layer
  • PiSSA β€” principal singular values and singular vectors
  • Prefix Tuning β€” trainable prefix tokens prepended to each layer
  • P-Tuning v2 β€” deep prompt tuning
  • IAΒ³ β€” rescale activations with learned vectors (< 0.1% params)

9.5 Constitutional AI (Anthropic's Claude Approach)

  • Define principles/constitution for model behavior
  • CAI-SL: fine-tune on critiques and revisions following constitution
  • CAI-RL: use AI feedback instead of human (RLAIF)
    • Generate responses β†’ evaluate with constitution β†’ rank β†’ RL
  • Red-teaming: adversarial probing for harmful outputs
  • Harmlessness + Helpfulness + Honesty triad

9.6 Model Merging

  • SLERP β€” spherical linear interpolation of model weights
  • TIES-Merging β€” trim + elect sign + disjoint merge
  • DARE β€” random pruning before merging (sparse delta weights)
  • Model Soup β€” average fine-tuned models (Wortsman et al.)
  • Mergekit library for practical model merging

10. Major Algorithms & Techniques Reference

10.1 Generation Algorithms

  • Greedy Decoding β€” always pick highest probability token
  • Beam Search β€” maintain top-B hypotheses at each step
  • Sampling β€” sample from probability distribution
  • Temperature Scaling β€” T < 1 = sharper; T > 1 = softer
    • P'(x) = softmax(logits / T)
  • Top-K Sampling β€” sample from top K tokens only
  • Top-P (Nucleus) Sampling β€” sample from tokens summing to probability P
  • Min-P Sampling β€” minimum probability threshold relative to top token
  • Typical Sampling β€” sample tokens of typical information content
  • Contrastive Search β€” maximize: (1-Ξ±)*p(x) - Ξ±*max_j cos_sim(h_x, h_xj)
  • Speculative Decoding β€” small draft model proposes, large model verifies
    • ~2-4Γ— speedup with no quality loss
  • Medusa β€” parallel draft heads on single model

10.2 Inference Optimization

  • KV Cache β€” store past key-value pairs, O(1) per new token
  • Continuous Batching β€” dynamic batching, no waiting for sequences to finish
  • PagedAttention β€” virtual memory for KV cache
  • Quantization:
    • PTQ (Post-Training Quantization): GPTQ, AWQ, SmoothQuant
    • QAT (Quantization-Aware Training): fp8, int8 training
    • W4A16: 4-bit weights, 16-bit activations (most common)
    • GGUF format: llama.cpp quantization (Q4_K_M, Q5_K_S, etc.)
  • Weight Sharing/Tying β€” tie embedding and output projection
  • Knowledge Distillation β€” small student learns from large teacher
    • Response distillation: match output distributions
    • Feature distillation: match intermediate representations
  • Pruning:
    • Magnitude pruning: remove small-weight connections
    • Structured pruning: remove entire heads/layers
    • SparseGPT: one-shot unstructured pruning for GPT models

10.3 Long Context Techniques

  • Sliding Window Attention β€” Mistral's local attention
  • LongFormer β€” local + global attention tokens
  • BigBird β€” random + local + global attention
  • ALiBi β€” linear bias enables zero-shot length generalization
  • RoPE scaling variants:
    • Linear interpolation (position_id / scale_factor)
    • NTK-aware interpolation
    • YaRN (Yet Another RoPE Extension)
    • LongRoPE (progressive rescaling)
  • Retrieval Augmented Generation (RAG):
    • Dense retrieval (DPR, E5, BGE embeddings)
    • Sparse retrieval (BM25)
    • Hybrid retrieval
    • Reranking (ColBERT, cross-encoder)

10.4 Reasoning & Chain of Thought

  • Chain-of-Thought (CoT) prompting β€” "Let's think step by step"
  • Self-Consistency β€” sample multiple CoT paths, majority vote
  • Tree of Thought (ToT) β€” tree search over reasoning steps
  • Graph of Thought β€” arbitrary DAG of reasoning
  • Program-Aided Language Models (PAL) β€” generate executable code
  • ReAct β€” interleaved reasoning and action (tool use)
  • Process Reward Models (PRM) β€” reward each reasoning step
  • Outcome Reward Models (ORM) β€” reward final answer only
  • MCTS for LLM reasoning β€” Monte Carlo Tree Search guided by PRM
  • o1/o3-style reasoning β€” long chain-of-thought with test-time compute scaling

11. Tools, Frameworks & Libraries

11.1 Deep Learning Frameworks

FrameworkUse CaseKey Feature
PyTorchResearch & productionDynamic graphs, pythonic
JAXGoogle TPU trainingXLA compilation, functional
TensorFlowProduction deploymentTF Serving, TFLite
MXNetAWS ecosystemGluon API
PaddlePaddleBaidu ecosystemChinese NLP focus

11.2 LLM Training Frameworks

FrameworkOrganizationBest For
Megatron-LMNVIDIALarge-scale 3D parallel training
DeepSpeedMicrosoftZeRO optimization, ZeRO-Infinity
FSDPMeta/PyTorchSimpler full sharding
Colossal-AIHPC-AI TechHeterogeneous training
AlpaUCB/GoogleAuto-parallelism
LLaMA-FactoryCommunityFine-tuning factory
AxolotlOpenAccessYAML-configured fine-tuning
TRLHuggingFaceRLHF/DPO training
OpenRLHFOpenLLMAIScalable RLHF
NanotronHuggingFaceLightweight pre-training

11.3 Inference Frameworks

FrameworkFocusKey Feature
vLLMHigh throughputPagedAttention, continuous batching
TGI (Text Generation Inference)HuggingFaceProduction API
llama.cppLocal/edgeCPU inference, GGUF
OllamaLocal deploymentEasy model management
TensorRT-LLMNVIDIA GPUTensorRT optimized kernels
MLC-LLMMulti-platformWeb, mobile, server
ExLlamaV2Consumer GPUGPTQ inference
CTransformersPython bindingsllama.cpp Python
LightLLMTriton kernelsFlashAttention2
SGLangStructured generationRadixAttention

11.4 Model Hub & Ecosystem

  • HuggingFace Hub β€” 500K+ models, datasets, spaces
    • Transformers library: universal model API
    • Datasets library: 50K+ datasets
    • PEFT library: LoRA, prefix tuning, etc.
    • Accelerate: multi-GPU/TPU training
    • Tokenizers: fast Rust tokenizers
  • PyTorch Hub β€” model repository
  • Weights & Biases (WandB) β€” experiment tracking
  • MLflow β€” experiment tracking + model registry
  • DVC β€” data version control
  • LangChain β€” LLM application framework
  • LlamaIndex β€” RAG and data indexing
  • Haystack β€” NLP pipeline framework

11.5 Data Processing

  • Apache Spark β€” distributed data processing
  • Ray β€” distributed Python, Ray Data
  • DataTrove β€” HuggingFace data processing pipeline
  • The Stack deduplication β€” MinHash LSH at scale
  • SentenceTransformers β€” embedding models
  • FAISS β€” fast ANN search for vectors
  • Elasticsearch β€” BM25 + vector search

11.6 Evaluation Frameworks

BenchmarkTestsSize
MMLUWorld knowledge, 57 subjects14K questions
HellaSwagCommonsense reasoning70K examples
HumanEvalCode generation164 problems
MBPPPython programming500 problems
GSM8KGrade school math8.5K problems
MATHCompetition math12.5K problems
ARC-ChallengeScience QA1.2K questions
TruthfulQAFactual accuracy817 questions
BIG-BenchDiverse reasoning204 tasks
MT-BenchChat multi-turn80 questions
Chatbot ArenaHuman preference1M+ votes
lm-evaluation-harnessEleutherAI eval frameworkAll benchmarks

12. Hardware Requirements by Model Type

12.1 GPU Reference Table

Consumer GPUs

GPUVRAMMemory BWTFLOPs (BF16)Best Use
RTX 308010GB760 GB/s29.8Inference ≀7B
RTX 309024GB936 GB/s35.6Inference ≀13B
RTX 408016GB736 GB/s48.7Inference ≀13B
RTX 409024GB1008 GB/s82.6Inference/fine-tune ≀33B
RTX 509032GB1792 GB/s209Inference/fine-tune ≀70B

Data Center GPUs

GPUVRAMMemory BWTFLOPs (BF16)NVLinkBest Use
A100 40GB40GB1.6 TB/s312YesTraining ≀13B
A100 80GB80GB2.0 TB/s312YesTraining ≀30B
H100 SXM80GB3.35 TB/s989YesTraining ≀70B
H100 NVL94GB3.9 TB/s1979YesLarge models
H200141GB4.8 TB/s1979Yes70B+ training
B200192GB8 TB/s~4500YesFrontier models

Multi-GPU Requirements for Training

Model Size | FP16/BF16 Memory | 8Γ—A100 40GB | 8Γ—A100 80GB | 8Γ—H100
-----------|------------------|-------------|-------------|--------
7B params  | ~14GB (weights)   | Yes         | Yes         | Yes
13B        | ~26GB            | With ZeRO3  | Yes         | Yes
30B        | ~60GB            | With ZeRO3  | With ZeRO3  | Yes
70B        | ~140GB           | No          | With ZeRO3  | Yes
175B       | ~350GB           | No          | 4 nodes     | 2 nodes
405B       | ~810GB           | No          | No          | 4+ nodes

12.2 Memory Math

Model Parameters Memory (bytes):
- FP32: 4 bytes/param
- BF16/FP16: 2 bytes/param
- INT8: 1 byte/param
- INT4/NF4: 0.5 bytes/param

Training Memory = model + gradients + optimizer states
  - SGD: model + gradients = 2Γ—model
  - Adam/AdamW: model + gradients + 2Γ—optimizer = 4Γ—model
  - AMP: model(fp16) + model(fp32 master) + gradients + optimizer
         = 2 + 4 + 2 + 8 = 16 bytes/param

Inference Memory = model + KV_cache + activations
  KV cache = 2 Γ— n_layers Γ— n_kv_heads Γ— head_dim Γ— seq_len Γ— bytes

Example: 7B model training with AdamW in BF16:
  7B Γ— 16 bytes = 112GB β†’ needs 2Γ— A100 80GB with ZeRO

12.3 TPU (Google)

  • TPU v4: 275 TFLOPS BF16, 32GB HBM, 600 GB/s
  • TPU v5e: purpose-built inference, 4Γ— efficiency vs v4
  • TPU v5p: training powerhouse, 459 TFLOPS
  • Cloud TPU Pods: 4096 chips interconnected (exaFLOP scale)
  • Native JAX/XLA support; PyTorch via torch_xla

12.4 Infrastructure

  • Networking: InfiniBand (400 Gb/s) between nodes, NVLink within node
  • Storage: Lustre parallel filesystem, AWS FSx, GCS
    • Read bandwidth: 1-100 GB/s for efficient data loading
  • CPU: AMD EPYC/Intel Xeon for data preprocessing
  • RAM: 512GB–2TB per node for large batch, ZeRO-Infinity offload
  • NVMe: 30+ TB fast local storage for checkpoint/cache
  • Power: H100 SXM: 700W; full node (8Γ— H100): ~10kW

13. Architecture Designs β€” Working Principles

13.1 Complete Transformer Forward Pass (Decoder-Only)

INPUT: Token sequence [t_1, t_2, ..., t_n]

STEP 1: Embedding
  x = Embedding(tokens) + PositionalEncoding(positions)
  x ∈ R^(n Γ— d_model)

STEP 2: For each of L transformer layers:

  a) Layer Norm (Pre-Norm)
     x_norm = LN(x)   or   RMSNorm(x)
  
  b) Causal Self-Attention
     Q = x_norm @ W_Q    K = x_norm @ W_K    V = x_norm @ W_V
     
     Apply RoPE to Q and K:
     Q, K = apply_rotary_embedding(Q, K, positions)
     
     Split into h attention heads
     
     For each head i:
       A_i = softmax((Q_i @ K_i^T) / √d_k + causal_mask) @ V_i
     
     Concatenate: A = [A_1; A_2; ...; A_h]
     Output: Attn_out = A @ W_O
     
     Residual: x = x + Attn_out
  
  c) Layer Norm (Pre-Norm again)
     x_norm2 = LN(x)   or   RMSNorm(x)
  
  d) Feed-Forward Network (SwiGLU)
     gate = SiLU(x_norm2 @ W_gate)
     up   = x_norm2 @ W_up
     FFN  = (gate * up) @ W_down
     
     Residual: x = x + FFN

STEP 3: Final Layer Norm
  x = RMSNorm(x)

STEP 4: Language Model Head (Linear projection + Softmax)
  logits = x @ W_lm_head     (or use tied embedding weights)
  probs  = softmax(logits / temperature)

STEP 5: Sample/select next token
  next_token = sample(probs)  or  argmax(probs)

13.2 GPT Architecture (Decoder-Only)

  • Unidirectional attention (causal mask)
  • Predict next token: P(x_t | x_1...x_{t-1})
  • GPT-1: 12 layers, 768 d_model, 117M params
  • GPT-2: 48 layers, 1600 d_model, 1.5B params
  • GPT-3: 96 layers, 12288 d_model, 175B params
  • GPT-4: MoE, ~8Γ—220B, estimated ~1.8T params (unconfirmed)
  • Training: Causal LM, web text
  • No official architecture paper for GPT-4

13.3 Llama Architecture Details

  • Llama 1/2: RoPE, RMSNorm, SwiGLU FFN, GQA (Llama 2)
  • Llama 3: 128K vocab, GQA, 128K context
  • Architecture difference from GPT:
    • No biases in linear layers
    • RMSNorm instead of LayerNorm (no mean subtraction)
    • RoPE instead of absolute PE
    • SwiGLU with 3 matrices (up, gate, down) instead of 2
    • GQA: fewer KV heads than query heads

13.4 BERT Architecture (Encoder-Only)

  • Bidirectional attention (all tokens attend to all)
  • Pre-training objectives:
    • MLM: mask 15% tokens, predict them
    • NSP: predict if sentence B follows sentence A
  • Fine-tune on downstream tasks with task head
  • CLS token embedding β†’ classification

13.5 T5 Architecture (Encoder-Decoder)

  • Encoder: bidirectional full attention
  • Decoder: causal self-attention + cross-attention to encoder
  • All tasks as text-to-text: "Translate: ..." β†’ "..."
  • Relative positional biases instead of absolute PE

13.6 Mamba / SSM Architecture (Alternative to Transformer)

State Space Model Core:
h'(t) = A * h(t) + B * x(t)
y(t) = C * h(t) + D * x(t)

Discretized:
h_t = Δ€ * h_{t-1} + BΜ„ * x_t
y_t = C * h_t

Mamba adds: selective scan mechanism (S4 + selectivity)
Ξ”, B, C are input-dependent (unlike fixed SSMs)

Advantages:
- Linear O(N) training time
- O(1) memory per inference step
- Competitive with Transformers on long sequences
  • Mamba 2 β€” parallel scan, SSD (Structured State Space Duality)
  • Jamba β€” interleaved Mamba + Transformer layers (AI21)
  • Falcon Mamba β€” pure SSM language model

13.7 Mixture of Experts (MoE) Architecture

Expert Router:
  g(x) = Softmax(TopK(x @ W_router))
  
Each token routes to K experts:
  output = Ξ£ g_k(x) * Expert_k(x)

Load Balancing:
  L_aux = Ξ± * Ξ£_i f_i * P_i
  f_i = fraction of tokens routed to expert i
  P_i = fraction of router probability on expert i
  • Mixtral 8Γ—7B: 8 FFN experts, 2 active; effectively 12.9B active
  • DeepSeek-V3: 256 experts, 8 active + 1 shared expert always active
  • Switch Transformer: top-1 routing, simpler but less expressive

14. Complete Design & Development Process

14.1 Phase 0: Problem Definition & Scoping (Weeks 1-2)

Decision Framework:
β”œβ”€β”€ Model Purpose
β”‚   β”œβ”€β”€ General assistant (broad capability)
β”‚   β”œβ”€β”€ Domain specialist (legal, medical, code)
β”‚   β”œβ”€β”€ Multilingual (cover which languages?)
β”‚   └── Multimodal (text + image + audio?)
β”œβ”€β”€ Scale Decision
β”‚   β”œβ”€β”€ <1B: edge deployment, specialized tasks
β”‚   β”œβ”€β”€ 1-7B: consumer hardware, good balance
β”‚   β”œβ”€β”€ 7-70B: server deployment, high capability
β”‚   └── 70B+: frontier capability, data center
β”œβ”€β”€ Compute Budget
β”‚   β”œβ”€β”€ GPU-hours Γ— cost β†’ max tokens you can train
β”‚   β”œβ”€β”€ Use Chinchilla formula for optimal allocation
β”‚   └── Factor in inference cost at scale
└── Success Metrics
    β”œβ”€β”€ Benchmark targets (MMLU, HumanEval, etc.)
    β”œβ”€β”€ Latency requirements (tokens/sec)
    └── Cost per token at serving scale

14.2 Phase 1: Data Pipeline (Months 1-2)

Step 1: Data Acquisition
  - Download Common Crawl (use cc-net or datatrove)
  - Acquire books, code, scientific papers
  - License check everything

Step 2: Setup Processing Infrastructure
  pip install datatrove apache-beam
  # Spark cluster or Ray cluster for scale

Step 3: Implement Quality Pipeline
  quality_pipeline = [
    URLFilter(block_list=ADULT_DOMAINS),
    LanguageFilter(languages=["en"], min_prob=0.65),
    GopherQualityFilter(min_words=50, max_ratio_bullet_lines=0.9),
    C4QualityFilter(),
    ParagraphFilter(min_paragraphs=3),
  ]

Step 4: Deduplication
  minhash_dedup = MinHashDedup(
      n_shingles=5,
      n_buckets=14,
      n_hashes_per_bucket=8,
      threshold=0.7
  )

Step 5: Tokenize & Pack
  # BPE tokenizer training
  tokenizer = Tokenizer(BPE(unk_token="<unk>"))
  tokenizer.train(files=corpus_files, vocab_size=32000)
  
  # Pack sequences to max_length, add BOS/EOS
  # Use numpy memmap for efficient storage

14.3 Phase 2: Model Implementation (Month 2-3)

# Complete minimal Llama-style transformer implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass

@dataclass
class ModelConfig:
    vocab_size: int = 32000
    d_model: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: int = 8          # GQA
    max_seq_len: int = 4096
    ffn_dim: int = 14336
    rms_norm_eps: float = 1e-5

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        norm = x.pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
        return x * norm * self.weight

def precompute_freqs(dim, max_len, theta=10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(max_len)
    freqs = torch.outer(t, freqs)
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
    return freqs_cis

def apply_rotary_emb(xq, xk, freqs_cis):
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = freqs_cis[:xq.shape[1]].unsqueeze(0).unsqueeze(2)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

class Attention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_heads = config.n_heads
        self.n_kv_heads = config.n_kv_heads
        self.head_dim = config.d_model // config.n_heads
        self.n_rep = self.n_heads // self.n_kv_heads
        
        self.wq = nn.Linear(config.d_model, config.n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(config.d_model, config.n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(config.d_model, config.n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(config.n_heads * self.head_dim, config.d_model, bias=False)
    
    def forward(self, x, freqs_cis, mask=None):
        B, T, _ = x.shape
        xq = self.wq(x).view(B, T, self.n_heads, self.head_dim)
        xk = self.wk(x).view(B, T, self.n_kv_heads, self.head_dim)
        xv = self.wv(x).view(B, T, self.n_kv_heads, self.head_dim)
        
        xq, xk = apply_rotary_emb(xq, xk, freqs_cis)
        
        # Expand KV for GQA
        xk = xk.repeat_interleave(self.n_rep, dim=2)
        xv = xv.repeat_interleave(self.n_rep, dim=2)
        
        # Flash Attention via scaled_dot_product_attention
        xq = xq.transpose(1, 2)
        xk = xk.transpose(1, 2)
        xv = xv.transpose(1, 2)
        
        out = F.scaled_dot_product_attention(xq, xk, xv, 
                                              attn_mask=mask, 
                                              is_causal=True)
        out = out.transpose(1, 2).contiguous().view(B, T, -1)
        return self.wo(out)

class SwiGLU(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.w1 = nn.Linear(config.d_model, config.ffn_dim, bias=False)
        self.w2 = nn.Linear(config.ffn_dim, config.d_model, bias=False)
        self.w3 = nn.Linear(config.d_model, config.ffn_dim, bias=False)
    
    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = Attention(config)
        self.feed_forward = SwiGLU(config)
        self.attention_norm = RMSNorm(config.d_model, config.rms_norm_eps)
        self.ffn_norm = RMSNorm(config.d_model, config.rms_norm_eps)
    
    def forward(self, x, freqs_cis, mask=None):
        x = x + self.attention(self.attention_norm(x), freqs_cis, mask)
        x = x + self.feed_forward(self.ffn_norm(x))
        return x

class Transformer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.embeddings = nn.Embedding(config.vocab_size, config.d_model)
        self.layers = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)])
        self.norm = RMSNorm(config.d_model, config.rms_norm_eps)
        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
        
        # Tie weights
        self.lm_head.weight = self.embeddings.weight
        
        # Precompute RoPE frequencies
        self.freqs_cis = precompute_freqs(config.d_model // config.n_heads, config.max_seq_len)
    
    def forward(self, tokens, targets=None):
        B, T = tokens.shape
        x = self.embeddings(tokens)
        freqs_cis = self.freqs_cis[:T].to(x.device)
        
        for layer in self.layers:
            x = layer(x, freqs_cis)
        
        x = self.norm(x)
        logits = self.lm_head(x)
        
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        
        return logits, loss

14.4 Phase 3: Training Infrastructure (Month 3-4)

# Training loop with FSDP + gradient accumulation

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools

def setup_training(config, model, train_dataset):
    # FSDP wrapping
    auto_wrap_policy = functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls={TransformerBlock}
    )
    model = FSDP(model, auto_wrap_policy=auto_wrap_policy,
                 mixed_precision=MixedPrecision(
                     param_dtype=torch.bfloat16,
                     reduce_dtype=torch.bfloat16,
                     buffer_dtype=torch.bfloat16,
                 ))
    
    # Optimizer
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=3e-4,
        betas=(0.9, 0.95),
        eps=1e-8,
        weight_decay=0.1
    )
    
    # Scheduler: warmup + cosine decay
    def lr_lambda(step):
        if step < warmup_steps:
            return step / warmup_steps
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return max(0.1, 0.5 * (1 + math.cos(math.pi * progress)))
    
    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
    
    return model, optimizer, scheduler

# Training step
def train_step(model, batch, optimizer, scaler, grad_accum_steps):
    tokens, targets = batch
    
    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        logits, loss = model(tokens, targets)
        loss = loss / grad_accum_steps
    
    loss.backward()
    
    if step % grad_accum_steps == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
    
    return loss.item() * grad_accum_steps

14.5 Phase 4: Evaluation & Iteration (Ongoing)

Evaluation Checkpoints:
  Every 1000 steps:
    - Validation loss (held-out data)
    - Perplexity on test set
  
  Every 5000 steps:
    - Run lm-evaluation-harness on core benchmarks
    - MMLU, HellaSwag, ARC, TruthfulQA
  
  After pre-training completes:
    - Full benchmark suite
    - Human evaluation samples
    - Red-teaming for safety issues

14.6 Phase 5: Post-Training (Month 5-6)

1. Collect SFT data:
   - Buy/license instruction datasets
   - Use GPT-4 to generate synthetic data
   - Human annotators for quality examples
   
2. Fine-tune with TRL/Axolotl:
   accelerate launch train_sft.py \
     --model_name_or_path base_model/ \
     --dataset_name sft_data \
     --max_seq_length 4096 \
     --num_train_epochs 3 \
     --per_device_train_batch_size 4 \
     --gradient_accumulation_steps 4

3. Collect preference data:
   - Sample 2+ outputs for each prompt
   - Human annotators rank outputs
   - Tools: LabelStudio, Argilla, Scale AI

4. Train reward model:
   python train_reward_model.py \
     --model sft_model/ \
     --data preference_data.json
   
5. RLHF/DPO:
   python train_dpo.py \
     --model sft_model/ \
     --reward_model rm_model/ \
     --beta 0.1

15. Reverse Engineering Existing LLMs

15.1 Approach & Methodology

Reverse engineering modern LLMs means studying their papers, open implementations, and behavioral analysis to understand design decisions.

15.2 Reverse Engineering GPT-4 (What We Know)

  • Architecture (from papers + leaks):
    • Mixture of Experts: ~8 experts, 2 active per token
    • Estimated 1.8T total params, ~200B active per forward pass
    • ~120 transformer layers
    • Context: 128K tokens (GPT-4 Turbo)
  • Training Data: ~13T tokens estimated
  • RLHF: Extensive human feedback + InstructGPT methodology
  • Safety: Constitutional AI-like red-teaming
  • Multimodal: CLIP-style vision encoder + projection

15.3 Reverse Engineering Llama 3.1 (Open Weights)

# Inspect Llama 3.1 70B architecture
from transformers import AutoModelForCausalLM, AutoConfig

config = AutoConfig.from_pretrained("meta-llama/Meta-Llama-3.1-70B")
print(config)

# Key findings:
# hidden_size: 8192
# intermediate_size: 28672
# num_attention_heads: 64
# num_key_value_heads: 8   ← GQA (8 KV heads vs 64 Q heads)
# num_hidden_layers: 80
# rope_theta: 500000.0
# vocab_size: 128256
# max_position_embeddings: 131072
# rms_norm_eps: 1e-05
# hidden_act: "silu"

15.4 Behavioral Reverse Engineering

Techniques:
1. Prompt probing β€” test specific capabilities systematically
2. Activation patching β€” identify which layers encode which info
   (requires white-box access or similar open model)
3. Mechanistic interpretability:
   - Identify attention head functions (induction heads, copy heads)
   - Superposition hypothesis: polysemantic neurons
   - Sparse autoencoders to find features (Anthropic's SAE work)
4. Logit lens β€” project intermediate representations to vocab
5. Activation analysis β€” t-SNE/UMAP of hidden states
6. Probing classifiers β€” train linear probes on hidden states

15.5 Studying Open Source LLMs

Key open models to study (in order of insight value):

1. GPT-2 (117M) β€” OpenAI, fully open, educational
   git clone https://github.com/openai/gpt-2

2. LLaMA 3 (8B-405B) β€” Meta, open weights + tokenizer details
   Excellent reference architecture

3. Mistral 7B β€” reference for sliding window + GQA
   
4. Falcon (1B-180B) β€” Technology Innovation Institute
   Original GQA + MQA reference

5. Pythia (70M-12B) β€” EleutherAI, training checkpoints available
   Study training dynamics over time

6. OLMo (7B) β€” Allen AI, truly open (code + data + checkpoints)
   Best for training process study

7. MosaicML MPT β€” HuggingFace-native architecture

Study approach:
- Read architecture paper
- Clone training codebase
- Trace forward pass manually
- Measure parameter counts per component
- Profile memory and compute requirements

16. Building Your Own LLM Service

16.1 Service Architecture Overview

                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚    Load Balancer     β”‚
                     β”‚   (nginx/Traefik)    β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                     β”‚                     β”‚
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
 β”‚   API Server   β”‚   β”‚   API Server   β”‚   β”‚   API Server   β”‚
 β”‚  (FastAPI)     β”‚   β”‚  (FastAPI)     β”‚   β”‚  (FastAPI)     β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                     β”‚                     β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚   Request Router   β”‚
                      β”‚  (Priority Queue)  β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                     β”‚                     β”‚
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
 β”‚ Inference Node β”‚   β”‚ Inference Node β”‚   β”‚ Inference Node β”‚
 β”‚  vLLM / TGI   β”‚   β”‚  vLLM / TGI   β”‚   β”‚  vLLM / TGI   β”‚
 β”‚  (4Γ—H100)      β”‚   β”‚  (4Γ—H100)      β”‚   β”‚  (4Γ—H100)      β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚         Supporting Services           β”‚
 β”‚  Redis (cache) | PostgreSQL (logs)    β”‚
 β”‚  Prometheus (metrics) | Grafana (viz) β”‚
 β”‚  MinIO (model artifacts)              β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

16.2 API Layer Implementation

# FastAPI server for LLM service
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import asyncio
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams

app = FastAPI(title="LLM API Service")

# Initialize vLLM engine
engine_args = AsyncEngineArgs(
    model="your_model_path",
    tensor_parallel_size=4,    # 4 GPUs
    gpu_memory_utilization=0.95,
    max_num_batched_tokens=32768,
    max_num_seqs=256,
    enable_chunked_prefill=True,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

class ChatRequest(BaseModel):
    messages: list[dict]
    max_tokens: int = 2048
    temperature: float = 0.7
    top_p: float = 0.9
    stream: bool = False

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    # Apply chat template
    prompt = apply_chat_template(request.messages)
    
    sampling_params = SamplingParams(
        max_tokens=request.max_tokens,
        temperature=request.temperature,
        top_p=request.top_p,
    )
    
    if request.stream:
        return StreamingResponse(
            stream_generator(prompt, sampling_params),
            media_type="text/event-stream"
        )
    
    # Non-streaming
    results = await engine.generate(prompt, sampling_params, request_id=str(uuid4()))
    async for result in results:
        final_output = result
    
    return format_openai_response(final_output)

async def stream_generator(prompt, sampling_params):
    async for output in engine.generate(prompt, sampling_params, str(uuid4())):
        chunk = format_stream_chunk(output)
        yield f"data: {json.dumps(chunk)}\n\n"
    yield "data: [DONE]\n\n"

16.3 Deployment with Kubernetes

# k8s deployment for LLM inference
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-inference
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model
          - /models/llama-3-70b
          - --tensor-parallel-size
          - "4"
          - --max-num-batched-tokens
          - "32768"
          - --port
          - "8000"
        resources:
          limits:
            nvidia.com/gpu: 4
          requests:
            memory: "200Gi"
            cpu: "32"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
      nodeSelector:
        nvidia.com/gpu.product: "H100-SXM-80GB"

16.4 Monitoring & Observability

# Prometheus metrics for LLM service
from prometheus_client import Counter, Histogram, Gauge

REQUEST_COUNT = Counter('llm_requests_total', 'Total requests', ['model', 'status'])
REQUEST_LATENCY = Histogram('llm_request_latency_seconds', 
                             'Request latency', ['model'],
                             buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0])
TOKENS_GENERATED = Counter('llm_tokens_generated_total', 'Tokens generated', ['model'])
GPU_MEMORY_USED = Gauge('llm_gpu_memory_bytes', 'GPU memory used', ['gpu_id'])
QUEUE_SIZE = Gauge('llm_queue_size', 'Current queue depth')

16.5 Cost Estimation

Infrastructure Cost Example (7B model, 100K daily users):
  
Serving: 2Γ— 8Γ—A100 nodes (AWS p4d.24xlarge)
  Cost: ~$32/hr/node Γ— 2 = $64/hr = $1,536/day = $46K/month

Storage: 100TB (model, logs, cache)
  Cost: ~$2,300/month (S3)

Network: 10TB outbound/day
  Cost: ~$900/month

Training (one-time, 7B model):
  ~7B params Γ— 140B tokens / (312 TFLOPS Γ— 0.4 efficiency)
  β‰ˆ 9.6M GPU-hours on A100 β†’ Actually ~300K GPU-hours
  Cost: ~$300K one-time for quality 7B model

Break-even: ~$0.001/1K tokens at scale

17. Cutting-Edge Developments

17.1 Test-Time Compute Scaling (2024-2025)

  • OpenAI o1/o3 β€” extended chain-of-thought reasoning
    • Models "think" for seconds to minutes before answering
    • Process Reward Models (PRMs) guide reasoning
    • MCTS/beam search over reasoning steps
    • Breakthrough on AIME math, competition programming
  • DeepSeek-R1 β€” open-source reasoning model
    • GRPO training (Group Relative Policy Optimization)
    • RL directly on reasoning without SRM labeling
    • Matches o1 on many benchmarks at lower cost
  • Test-time compute scaling law: more inference compute β†’ better results

17.2 Multimodal LLMs

  • Architecture: Vision encoder β†’ projector β†’ LLM
    • CLIP/SigLIP β†’ Linear/MLP β†’ Decoder-only LLM
  • GPT-4V/GPT-4o: images, audio, text unified
  • Gemini 1.5 Pro: 1M context, native multimodal
  • LLaVA / LLaVA-NeXT: open multimodal models
  • Qwen-VL: image/video understanding
  • Video LLMs: VideoLLaMA, Video-LLaVA, Qwen2-VL
  • Any-to-Any: Unified IO, CoDi, NExT-GPT

17.3 Efficient Architecture Innovations

  • GQA (2023) β€” grouped query attention, now standard
  • Sliding Window + Full Attention Hybrid β€” Mistral approach
  • MLA (Multi-head Latent Attention) β€” DeepSeek-V2/V3
    • Low-rank KV compression: 93% KV cache reduction
    • Match MHA quality with MQA efficiency
  • Differential Attention β€” Microsoft 2024
    • Cancel noise in attention with difference of two softmax
  • Linear Attention / RetNet / RWKV / Mamba
    • Subquadratic alternatives to standard attention
  • TTT (Test-Time Training) β€” context as gradient descent

17.4 Training Innovations

  • Flash Attention 3 β€” hardware-aware for H100 FP8
  • FP8 Training β€” native 8-bit training on H100
  • Online RLHF β€” continuously update RM with new data
  • RLAIF β€” AI feedback replacing human annotation
  • Constitutional AI 2.0 β€” multi-principle alignment
  • Direct Preference Optimization variants (IPO, KTO, ORPO)
  • Synthetic Data Generation β€” Phi series, Llama instillation
  • Curriculum Learning β€” easyβ†’hard data ordering
  • Data Attribution β€” identify most influential training examples

17.5 Inference & Serving Innovations

  • Speculative Decoding β€” 2-4Γ— speedup, no quality loss
  • Medusa / EAGLE β€” parallel decoding heads
  • Continuous Batching β€” vLLM's signature feature
  • Chunked Prefill β€” interleave prefill and decode
  • Prefix Caching β€” reuse KV cache across requests
  • Quantization advances: GPTQ, AWQ, AQLM, FP8 inference
  • MoE routing optimization β€” expert parallelism
  • Disaggregated prefill/decode β€” separate servers for each phase

17.6 Long Context & Memory

  • Retrieval Augmented Generation 2.0
    • Self-RAG, FLARE, Adaptive RAG
    • Multi-hop reasoning over retrieved docs
  • Infinite context: StreamingLLM, MemGPT, Infini-Attention
  • Memory networks: Titans (2025), neural long-term memory
  • 1M context: Gemini 1.5, Claude 3 (200K), Llama 3.1 (128K)
  • Persistent memory systems: vector databases + LLM

17.7 Agentic AI (2024-2025)

  • Tool use / Function calling β€” structured JSON outputs
  • Code execution β€” Python interpreter as tool
  • Browser agents β€” web navigation (Computer Use, WebAgent)
  • Multi-agent systems β€” AutoGen, CrewAI, LangGraph
  • Long-horizon planning β€” hierarchical task decomposition
  • World models β€” model-based reasoning about environment

18. Build Ideas β€” Beginner to Advanced

🟒 Beginner Level (Months 1-6)

Project 1: GPT from Scratch (The Classic) Beginner

Goal: Build and train a character-level GPT
Skills: PyTorch basics, attention, training loop
Dataset: tiny_shakespeare.txt (~1MB)
Model: ~10K-100K parameters
Reference: Andrej Karpathy's "nanoGPT" tutorial

Project 2: Train a Tiny Tokenizer Beginner

Goal: Implement BPE tokenizer from scratch
Skills: String processing, Python
Dataset: Text corpus of your choice
Deliverable: Custom tokenizer matching tiktoken output

Project 3: BERT Fine-Tuning for Classification Beginner

Goal: Fine-tune BERT for sentiment analysis
Skills: HuggingFace Transformers, fine-tuning
Dataset: SST-2, IMDB, or custom
Deliverable: 90%+ accuracy classifier with API

Project 4: Chatbot with LoRA Fine-Tuning Beginner

Goal: Fine-tune Llama 3.1 8B on custom instructions
Skills: PEFT, QLoRA, Axolotl
Dataset: 1K-10K instruction pairs
Hardware: 1Γ— RTX 4090 or Colab A100

Project 5: RAG System Beginner

Goal: Build retrieval-augmented Q&A over documents
Skills: Embeddings, FAISS, LangChain
Components: PDF loader β†’ chunker β†’ embedder β†’ retriever β†’ LLM

🟑 Intermediate Level (Months 6-18)

Project 6: Train a 125M Parameter LLM Intermediate

Goal: Pre-train GPT-2 sized model on domain data
Skills: Distributed training, data pipeline, evaluation
Dataset: 10-50B tokens (domain-specific)
Hardware: 4-8Γ— A100 GPUs
Framework: Megatron-LM or custom PyTorch FSDP
Cost: ~$5K-20K compute

Project 7: Reward Model Training Intermediate

Goal: Train a reward model for RLHF
Skills: Preference data collection, Bradley-Terry model
Dataset: 50K+ comparison pairs
Deliverable: RM that scores responses 0-10
Evaluation: Accuracy on held-out comparisons

Project 8: Multimodal LLM (Vision + Text) Intermediate

Goal: Build LLaVA-style model
Architecture: CLIP ViT-L + projection MLP + Llama 3B
Training: 2-stage (align β†’ instruction-tune)
Dataset: LLaVA-CC3M-Pretrain-595K + LLaVA-Instruct-150K
Skills: Multimodal data, vision encoder integration

Project 9: Production Inference Service Intermediate

Goal: Deploy your fine-tuned model as a production API
Components:
  - vLLM/TGI inference engine
  - FastAPI with streaming support
  - Redis for rate limiting + caching
  - Prometheus + Grafana monitoring
  - Docker Compose β†’ Kubernetes migration
SLA: 99.9% uptime, <500ms p50 latency

Project 10: Code Generation Model Intermediate

Goal: Fine-tune or train a code-specialized LLM
Dataset: The Stack (languages you support)
Eval: HumanEval, MBPP, SWE-Bench
Features: FIM (fill-in-middle), multi-file context

πŸ”΄ Advanced Level (Months 18-36+)

Project 11: 7B Parameter Pre-training from Scratch Advanced

Goal: Train a competitive open-source 7B model
Budget: $200K-500K compute (negotiable with optimizations)
Data: 1-2T tokens of curated web + books + code
Architecture: Llama 3-style (GQA, RoPE, SwiGLU, RMSNorm)
Training: 3D parallelism on 64-128Γ— H100s
Evaluation: Competitive with Llama 3 8B on MMLU, HellaSwag

Project 12: Full RLHF Pipeline Advanced

Goal: Complete SFT β†’ RM β†’ PPO pipeline
SFT: 500K high-quality instruction examples
RM: 100K preference comparisons, 75%+ agreement accuracy
PPO: Stable training, no mode collapse
Deliverable: RLHF-tuned model preferred over SFT by humans
Tools: OpenRLHF or custom PPO implementation

Project 13: Reasoning Model (o1-style) Advanced

Goal: Build a reasoning model with extended CoT
Approach 1: MCTS + PRM training
Approach 2: GRPO like DeepSeek-R1
Dataset: Math (MATH, AMC, AIME) + code problems
Metric: AIME accuracy, competition math benchmarks
Novel contribution: Improved search algorithm or reward shaping

Project 14: MoE Language Model Advanced

Goal: Build Mixtral-style MoE model
Architecture: 8 experts, top-2 routing, 7B active params
Challenge: Load balancing, expert collapse prevention
Benefit: 47B total params, only 12.9B compute
Framework: Megablocks or custom CUDA kernel

Project 15: LLM Research Contribution Advanced

Goal: Novel research contribution publishable at ACL/NeurIPS/ICLR
Ideas:
  - New attention mechanism for long context
  - Better data selection algorithm
  - Novel PEFT method
  - Interpretability finding
  - New benchmark or evaluation methodology
  - Alignment technique
  - Efficient architecture variant
Process: Baseline β†’ ablation β†’ comparison β†’ writeup β†’ submission

19. Research Papers You Must Read

Foundational

  1. Attention Is All You Need (Vaswani et al., 2017) β€” The Transformer
  2. BERT (Devlin et al., 2018) β€” Bidirectional pre-training
  3. GPT-2 (Radford et al., 2019) β€” Language model pre-training
  4. GPT-3 (Brown et al., 2020) β€” Few-shot learners, scaling
  5. Scaling Laws for Neural LMs (Kaplan et al., 2020)

Architecture

  1. RoFormer/RoPE (Su et al., 2021) β€” Rotary position embedding
  2. ALiBi (Press et al., 2021) β€” Attention with linear biases
  3. GQA (Ainslie et al., 2023) β€” Grouped query attention
  4. FlashAttention (Dao et al., 2022) β€” IO-aware attention
  5. FlashAttention-2 (Dao, 2023)
  6. Mistral 7B (Jiang et al., 2023) β€” SWA + GQA
  7. Mixtral (Jiang et al., 2024) β€” Sparse MoE
  8. Mamba (Gu & Dao, 2023) β€” Linear-time sequence modeling
  9. LLaMA (Touvron et al., 2023) and LLaMA 2 & 3

Training & Optimization

  1. Chinchilla (Hoffmann et al., 2022) β€” Scaling laws revised
  2. PaLM (Chowdhery et al., 2022) β€” Large-scale language modeling
  3. Megatron-LM (Shoeybi et al., 2019) β€” Efficient large model training
  4. ZeRO (Rajbhandari et al., 2020) β€” Memory optimization
  5. AdamW (Loshchilov & Hutter, 2017) β€” Decoupled weight decay
  6. Lion Optimizer (Chen et al., 2023)

Alignment & RLHF

  1. InstructGPT (Ouyang et al., 2022) β€” RLHF for instruction following
  2. Constitutional AI (Bai et al., 2022) β€” Anthropic's alignment
  3. DPO (Rafailov et al., 2023) β€” Direct preference optimization
  4. RLHF (Christiano et al., 2017) β€” Original RLHF paper
  5. Self-Play Fine-Tuning (SPIN) (Chen et al., 2024)

Inference

  1. Speculative Decoding (Leviathan et al., 2022)
  2. vLLM / PagedAttention (Kwon et al., 2023)
  3. GPTQ (Frantar et al., 2022) β€” Post-training quantization
  4. AWQ (Lin et al., 2023) β€” Activation-aware quantization
  5. QLoRA (Dettmers et al., 2023) β€” Efficient fine-tuning

Reasoning & Capabilities

  1. Chain-of-Thought Prompting (Wei et al., 2022)
  2. Self-Consistency (Wang et al., 2022)
  3. Tree of Thoughts (Yao et al., 2023)
  4. ReAct (Yao et al., 2022) β€” Reasoning + acting
  5. DeepSeek-R1 (DeepSeek, 2025) β€” Open reasoning model

Recent (2024-2025)

  1. DeepSeek-V3 (2024) β€” Efficient large MoE
  2. Gemini 1.5 (2024) β€” 1M context
  3. Claude 3 Technical Report β€” Constitutional AI advances
  4. Llama 3 (Meta, 2024) β€” Technical report
  5. Titans (2025) β€” Neural long-term memory

20. Complete Learning Timeline

Phase 1: Foundations (Months 1-3)

Month 1: Math & Programming
  Week 1-2: Linear algebra (3Blue1Brown + Gilbert Strang MIT)
  Week 3-4: Calculus, probability, statistics (Khan Academy + Bishop PRML)

Month 2: ML & DL Basics
  Week 1-2: Classical ML (Andrew Ng Coursera)
  Week 3-4: Deep learning (fast.ai Part 1, or d2l.ai)

Month 3: NLP & Transformers
  Week 1-2: NLP fundamentals, word vectors
  Week 3-4: Transformer from scratch + HuggingFace ecosystem
  Project: Train character-level GPT on Shakespeare

Phase 2: LLM Fundamentals (Months 4-6)

Month 4: Transformer Internals
  - Read: "Attention Is All You Need", GPT-2 paper, BERT paper
  - Implement: Multi-head attention, RoPE, RMSNorm from scratch
  - Project: Fine-tune BERT on custom classification task

Month 5: Training at Scale
  - Study: Megatron-LM, DeepSpeed ZeRO, FSDP
  - Implement: Distributed training with FSDP on 2-4 GPUs
  - Project: Train 125M GPT on ~1B token dataset

Month 6: Fine-Tuning & Alignment
  - Study: LoRA, QLoRA, SFT, DPO papers
  - Implement: LoRA adapter, QLoRA training pipeline
  - Project: Fine-tune Llama 3 8B on instruction dataset with QLoRA

Phase 3: Intermediate Skills (Months 7-12)

Month 7-8: Data Pipeline Engineering
  - Web scraping at scale, datatrove
  - Deduplication with MinHash
  - Quality filtering pipeline
  - Project: Build 10B token domain corpus

Month 9-10: Production Serving
  - vLLM deployment, FastAPI, Docker, K8s
  - Monitoring, autoscaling, caching
  - Project: Deploy fine-tuned 7B model as production API

Month 11-12: Evaluation & Benchmarking
  - Run lm-evaluation-harness
  - Build custom eval suite
  - Understand benchmarks: MMLU, HumanEval, MT-Bench
  - Project: Comprehensive eval of your model vs. baselines

Phase 4: Advanced Training (Months 13-24)

Month 13-15: Pre-training from Scratch
  - Architect and implement 7B parameter model
  - Data pipeline: 500B-1T tokens
  - 3D parallel training on H100 cluster
  - Training stability, loss monitoring, recovery

Month 16-18: Full RLHF Pipeline
  - Preference data collection tools
  - Reward model training and evaluation
  - PPO or DPO training
  - Safety evaluation + red-teaming

Month 19-21: Advanced Topics
  - Mixture of Experts
  - Multimodal extensions (vision + language)
  - Long context techniques
  - Speculative decoding

Month 22-24: Research Contribution
  - Novel technique or finding
  - Paper writing + submission
  - Open-source contribution

Phase 5: Mastery (Month 24+)

- Lead model development at company or in open source
- Publish research papers
- Build novel architectures
- Start your own AI company or project
- Contribute to frontier model development

πŸ“š Essential Resources

Books

Online Courses

Blogs & Communities

GitHub Repositories to Study