🧠 Complete LLM Development Roadmap

Building Your Own Large Language Model & AI Service (Like Claude, Gemini, ChatGPT)

Scope: This roadmap covers everything from foundational math to deploying a production-grade LLM service — structured learning paths, algorithms, architecture, hardware, reverse engineering, and cutting-edge developments.

Last Updated: 2025 | Covers models through Llama 3.1, DeepSeek-V3, DeepSeek-R1, Gemini 1.5, GPT-4o, Claude 3.5 | Total Estimated Effort: 1500–3000 hours of focused study and implementation

1. Foundation Prerequisites

1.1 Programming Languages

Python (Primary Language)
- OOP, functional programming, decorators, generators
- Async/await, multiprocessing, threading
- Memory management, profiling, optimization
- Type hints, dataclasses, abstract classes
C/C++ (Performance-critical components)
- Pointers, memory allocation, RAII
- CUDA extensions, custom kernels
CUDA (GPU Programming)
- Thread blocks, warps, shared memory
- Memory coalescing, kernel optimization
Bash/Shell (DevOps, automation)
SQL (Data management)
Rust (Optional — emerging for inference engines)

1.2 Computer Science Fundamentals

Data Structures: Arrays, Trees, Graphs, Hash Tables, Heaps
Algorithms: Sorting, Searching, Dynamic Programming, Graph algorithms
Complexity Analysis: Big-O notation, space/time tradeoffs
Distributed Systems: CAP theorem, consensus algorithms, sharding
Operating Systems: Process management, memory paging, I/O
Computer Networks: TCP/IP, HTTP/2, gRPC, WebSockets
Databases: Relational (PostgreSQL), NoSQL (MongoDB, Redis), Vector DBs

1.3 Software Engineering Practices

Version Control: Git, GitHub, branching strategies
Testing: Unit, integration, regression, load testing
CI/CD: GitHub Actions, Jenkins, Docker, Kubernetes
Design Patterns: Factory, Observer, Strategy, Pipeline
API Design: REST, GraphQL, gRPC
Containerization: Docker Compose, Kubernetes orchestration

2. Mathematics & Statistics Deep Dive

2.1 Linear Algebra (Most Critical)

Vectors & Spaces
- Vector operations, dot products, cross products
- Vector spaces, basis, span, linear independence
- Subspaces, null space, column space
Matrices
- Matrix multiplication, transpose, inverse
- Rank, determinant, trace
- Special matrices: diagonal, orthogonal, symmetric, positive definite
Eigendecomposition
- Eigenvalues, eigenvectors, characteristic polynomial
- Diagonalization, spectral theorem
- Power iteration, QR algorithm
Singular Value Decomposition (SVD)
- Full vs. truncated SVD
- Applications in dimensionality reduction, LoRA
- Relationship to PCA
Tensor Operations
- Higher-order tensors, tensor contractions
- Einstein summation notation (einsum)
- Tensor decomposition (Tucker, CP)
Norms & Distances
- L1, L2, Frobenius, nuclear norms
- Cosine similarity, KL divergence as distance

2.2 Calculus & Optimization

Differential Calculus
- Derivatives, partial derivatives, directional derivatives
- Chain rule, product rule, quotient rule
- Jacobian matrix, Hessian matrix
- Taylor series expansion
Integral Calculus
- Definite/indefinite integrals
- Fundamental theorem of calculus
- Numerical integration (quadrature)
Multivariable Calculus
- Gradient, divergence, curl
- Lagrange multipliers, constrained optimization
- Vector fields and flow
Optimization Theory
- Convex vs. non-convex optimization
- First and second-order optimality conditions
- Saddle points, local vs. global minima
- Lagrangian relaxation, KKT conditions

2.3 Probability & Statistics

Probability Theory
- Probability spaces, sample spaces, events
- Conditional probability, Bayes' theorem
- Law of large numbers, central limit theorem
- Moment generating functions
Probability Distributions
- Discrete: Bernoulli, Binomial, Poisson, Categorical
- Continuous: Gaussian, Uniform, Beta, Dirichlet, Laplace
- Multivariate distributions, covariance matrices
Information Theory
- Entropy, cross-entropy, joint entropy
- Kullback-Leibler (KL) divergence
- Mutual information, Jensen-Shannon divergence
- Minimum description length
Statistical Estimation
- Maximum likelihood estimation (MLE)
- Maximum a posteriori (MAP)
- Bayesian inference, prior/posterior
- Expectation-Maximization (EM) algorithm
Sampling Methods
- Monte Carlo sampling
- Markov Chain Monte Carlo (MCMC)
- Importance sampling
- Temperature sampling, top-k, top-p (nucleus sampling)

2.4 Numerical Methods

Floating point arithmetic, precision issues (fp16, bf16, fp32)
Numerical stability, gradient clipping
Fast Fourier Transform (FFT)
Sparse matrix operations
Iterative solvers (conjugate gradient)

3. Machine Learning Fundamentals

3.1 Core Concepts

Supervised, Unsupervised, Semi-supervised, Self-supervised learning
Bias-variance tradeoff, overfitting, underfitting
Regularization: L1/L2, dropout, weight decay, early stopping
Cross-validation, hyperparameter tuning
Feature engineering, normalization, standardization

3.2 Classical Algorithms

Linear Regression, Logistic Regression
Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM)
Support Vector Machines (SVM), kernel trick
K-Nearest Neighbors (KNN)
Naive Bayes, Gaussian Mixture Models
PCA, t-SNE, UMAP (dimensionality reduction)
K-Means, DBSCAN, Hierarchical clustering

3.3 Gradient Descent & Optimizers

Vanilla Gradient Descent — full-batch, slow but stable
Stochastic Gradient Descent (SGD) — noisy but generalizes
Mini-batch SGD — industry standard balance
Momentum — exponential moving average of gradients
Nesterov Momentum — look-ahead momentum update
AdaGrad — per-parameter adaptive learning rate
RMSProp — decaying average of squared gradients
Adam — combines momentum + RMSProp
- m_t = β1 * m_{t-1} + (1 - β1) * g_t
- v_t = β2 * v_{t-1} + (1 - β2) * g_t²
- θ = θ - α * m̂_t / (√v̂_t + ε)
AdamW — Adam with decoupled weight decay (preferred for LLMs)
Lion — EvoLved Sign Momentum (Google, 2023)
Sophia — Second-order optimizer for LLMs
LAMB/LARS — Large-batch distributed training optimizers

3.4 Loss Functions

Mean Squared Error (MSE), Mean Absolute Error (MAE)
Cross-Entropy Loss (language modeling: next-token prediction)
Binary Cross-Entropy, Categorical Cross-Entropy
Contrastive Loss, Triplet Loss (for embeddings)
REINFORCE / Policy Gradient Loss (for RLHF)

4. Deep Learning Core

4.1 Neural Network Basics

Perceptron, Multi-layer Perceptron (MLP)
Activation functions:
- ReLU: max(0, x) — dead neuron problem
- GeLU: x * Φ(x) — smooth, used in GPT/BERT
- SiLU/Swish: x * sigmoid(x) — Llama uses this
- Mish, ELU, Leaky ReLU
- Softmax: e^xi / Σe^xj — for probability distributions
Backpropagation algorithm, automatic differentiation
Weight initialization: Xavier/Glorot, He initialization, normal/uniform

4.2 Normalization Techniques

Batch Normalization — normalizes across batch dimension
- μ_B = (1/m) Σx_i; σ²_B = (1/m) Σ(x_i - μ_B)²
- Problems with small batches, sequential data
Layer Normalization — normalizes across feature dimension
- Used in all modern Transformers
- LN(x) = (x - μ) / σ * γ + β
RMS Normalization (RMSNorm) — simplified LayerNorm
- RMSNorm(x) = x / RMS(x) * γ; no mean subtraction
- Used in Llama, Mistral — more efficient
Group Normalization — between BatchNorm and LayerNorm
Pre-Norm vs. Post-Norm — Pre-Norm (before attention) is more stable for deep networks

4.3 Regularization Deep Dive

Dropout — randomly zero out neurons during training
DropPath/Stochastic Depth — drop entire residual paths
Label Smoothing — soften hard labels to prevent overconfidence
Weight Decay (L2) — penalize large weights
Gradient Clipping — cap gradient norm to prevent explosion
Mixup / CutMix — data augmentation regularizers

4.4 CNN, RNN, LSTM (Pre-Transformer Context)

Convolutional Neural Networks — local feature extraction
Recurrent Neural Networks — sequential dependencies
- Vanishing/exploding gradient problem
Long Short-Term Memory (LSTM) — gating mechanisms
Gated Recurrent Unit (GRU) — simplified LSTM
Seq2Seq with attention — foundation of Transformers
Encoder-Decoder architecture — original MT framework

5. Natural Language Processing (NLP)

5.1 Text Preprocessing

Tokenization strategies:
- Word-level: simple but large vocabulary
- Character-level: small vocab but long sequences
- Subword: best of both worlds
Byte-Pair Encoding (BPE) — GPT-2, GPT-4 tokenizer
- Start with characters, merge most frequent pairs iteratively
- Creates vocabulary of ~32k-100k tokens
WordPiece — BERT tokenizer, similar to BPE
SentencePiece — language-agnostic, Llama, T5
- Unigram language model variant
Tiktoken — OpenAI's fast tokenizer library
Stop words, stemming, lemmatization (less used with LLMs)
Text normalization: lowercasing, Unicode handling

5.2 Word Embeddings (Pre-Transformer)

Word2Vec (2013, Google)
- Skip-gram: predict context from center word
- CBOW: predict center from context
- Negative sampling optimization
GloVe — Global Vectors, co-occurrence matrix factorization
FastText — subword embeddings, handles OOV
ELMo — contextual embeddings from bidirectional LSTM
Semantic similarity, analogy tasks (king - man + woman = queen)

5.3 Classic NLP Tasks (Now handled end-to-end by LLMs)

Named Entity Recognition (NER)
Part-of-Speech (POS) tagging
Sentiment Analysis
Machine Translation
Summarization
Question Answering
Text Classification

6. Transformer Architecture — The Heart of LLMs

6.1 Original Transformer ("Attention Is All You Need" — 2017)

Input Embedding — token IDs → dense vectors (dim d_model)
Positional Encoding — add position info since no recurrence
- Sinusoidal: PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
- Learnable positional embeddings (BERT, GPT)
Encoder Stack — bidirectional, used for understanding
Decoder Stack — autoregressive, used for generation
Cross-Attention — decoder attends to encoder outputs

6.2 Attention Mechanism — Complete Breakdown

Attention(Q, K, V) = softmax(QK^T / √d_k) * V

Where:
- Q = Query matrix (what we're looking for)
- K = Key matrix (what's available to match)
- V = Value matrix (what we actually retrieve)
- d_k = dimension of keys (scaling factor)

Self-Attention — Q, K, V all come from same input
Cross-Attention — Q from decoder, K/V from encoder
Causal/Masked Attention — mask future tokens (GPT-style)
- Lower-triangular mask: M_ij = 0 if j > i, else -∞

6.3 Multi-Head Attention (MHA)

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O

head_i = Attention(Q*W_Q_i, K*W_K_i, V*W_V_i)

Each head learns different aspects of relationships
Typical: 8, 12, 16, 32, 64 heads
Parallel computation, concatenate then project

6.4 Feed-Forward Network (FFN)

FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

Or with GeLU:
FFN(x) = GeLU(xW_1 + b_1) * W_2 + b_2

SwiGLU variant (Llama):
FFN(x) = (SiLU(xW_1) * xW_3) * W_2

d_ff ≈ 4 * d_model (hidden expansion)
Two-thirds are learnable; one-third is activation

6.5 Residual Connections & Layer Norm

Residual Connection: x = x + Sublayer(LN(x))
Pre-norm (before sublayer) — better gradient flow
Post-norm (after sublayer) — original paper style
Why residuals: prevent vanishing gradients in deep networks

6.6 Positional Encoding Evolution

Absolute Positional Encoding — fixed sinusoidal (original)
Learnable Absolute PE — BERT, GPT-2
Relative Positional Encoding — Transformer-XL
- Encode distance between tokens, not absolute positions
ALiBi (Attention with Linear Biases) — linear penalty
- Add bias -|i-j| * m to attention scores
- Better length generalization than sinusoidal
RoPE (Rotary Position Embedding) — GPT-NeoX, Llama
- Rotate Q and K vectors by angle proportional to position
- RoPE(x, pos) = x * e^(i * θ * pos)
- Excellent length generalization, used in most SOTA models
- YaRN/LongRoPE — extend context of RoPE models
NoPE — no positional encoding, rely on attention patterns

6.7 Attention Variants & Optimizations

Multi-Query Attention (MQA) — single K, V shared across heads
- Reduces KV cache size by num_heads factor (PaLM, Falcon)
Grouped Query Attention (GQA) — groups of heads share K, V
- Balance between MHA quality and MQA efficiency (Llama 2/3)
Sliding Window Attention — each token attends to local window
- Mistral uses 4096-token sliding window
Flash Attention — IO-aware exact attention algorithm
- Tiles Q, K, V to fit in GPU SRAM
- No materializing full N×N attention matrix
- 2-4x speedup, O(N) memory instead of O(N²)
Flash Attention 2 & 3 — further optimizations for H100
PagedAttention — vLLM's memory-efficient KV cache paging
Ring Attention — distributes attention across devices for ultra-long sequences
Sparse Attention — attend to subset of tokens (Longformer, BigBird)
Linear Attention — approximate attention in O(N) time

7. Large Language Model Internals

7.1 Model Families & Architectures

Encoder-Only (BERT family)
- Bidirectional context, MLM pre-training
- Best for: classification, NER, embeddings
- Examples: BERT, RoBERTa, DeBERTa, ELECTRA
Decoder-Only (GPT family) ← Most modern LLMs
- Causal/autoregressive, CLM pre-training
- Best for: generation, chat, reasoning
- Examples: GPT-4, Claude, Llama, Mistral, Gemini
Encoder-Decoder (T5/Seq2Seq family)
- Encoder reads input, decoder generates output
- Best for: translation, summarization with source
- Examples: T5, FLAN-T5, BART, mBART

7.2 Scaling Laws

Chinchilla Scaling Laws (Hoffmann et al., 2022)
- Optimal: train tokens ≈ 20× number of parameters
- N_optimal ≈ 1.69 × 10^9 × C^0.49 (compute C in FLOPs)
- C_optimal ≈ 6ND (N params, D tokens)
- GPT-3: undertrained; Chinchilla: same compute, more tokens
OpenAI Scaling Laws (Kaplan et al., 2020)
- Loss scales as power law with compute, data, parameters
- L(N) ~ N^{-0.076}; L(D) ~ D^{-0.095}
Emergent Abilities — appear suddenly at certain scales
- In-context learning (~1B+), chain-of-thought (~100B+)
Neural Scaling Laws for LLM Inference
- Larger model + fewer inference steps > smaller model + more steps

7.3 Context Window & Memory

Context window — maximum tokens model can process
KV Cache — cache Key and Value tensors during generation
- Memory: 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element
- For Llama-3-70B, 100K context: ~30GB just for KV cache
KV Cache Compression
- StreamingLLM — sink tokens + recent window
- SnapKV — select important KV pairs
- MLA (Multi-head Latent Attention) — DeepSeek's innovation
Positional interpolation — extend context beyond training length
Infini-attention — compressive memory for infinite context
Mamba/SSM — linear recurrence, O(1) memory per step

7.4 Tokenizer Design Details

Vocabulary size tradeoffs
- Larger vocab: shorter sequences, faster, but larger embedding table
- Smaller vocab: longer sequences, slower, smaller model
- Typical: 32K (Llama 2), 128K (Llama 3, GPT-4), 256K (Gemini)
Special tokens: [BOS], [EOS], [PAD], [UNK], [MASK]
Chat templates: system/user/assistant turn formatting
- Llama: <|begin_of_text|><|start_header_id|>system<|end_header_id|>...
- ChatML: <|im_start|>system\n...<|im_end|>

7.5 Mixture of Experts (MoE)

Replace dense FFN with N expert FFNs + router
Top-K Routing: only K experts activated per token (K=1 or 2)
Load Balancing Loss: encourage equal use of all experts
Sparse MoE: Mixtral 8x7B, 8x22B — 8 experts, 2 active
Fine-grained MoE: DeepSeek-V3 — 256 experts, 8 active
Expert Choice routing — experts choose tokens (better balance)
Advantages: massive parameter count, same compute cost

8. Training Pipeline — From Scratch to Advance

8.1 Data Collection & Curation

Sources

Common Crawl — petabyte-scale web crawl (CC-Main, CC-News)
The Pile — EleutherAI's 825GB diverse dataset
RedPajama — open reproduction of LLaMA training data
ROOTS — multilingual BLOOM training data
Books: Project Gutenberg, Books3, BookCorpus
Code: GitHub (The Stack, StarCoder data), code contests
Scientific Papers: arXiv, PubMed, S2ORC
Wikipedia/Wikidata — high-quality factual text
StackExchange, Reddit — Q&A, discussion
Multilingual: CC-100, mC4, CulturaX

Data Processing Pipeline

Raw HTML/Text
    ↓
URL/Domain Filtering (dedup, quality domains)
    ↓
Language Identification (fastText, langdetect)
    ↓
Quality Filtering:
  - Perplexity filter (KenLM)
  - Heuristics (short docs, repetition ratio, symbol ratio)
  - ML classifiers (CCNet, Gopher quality filters)
    ↓
Deduplication:
  - Exact: MD5/SHA256 hashing
  - Near-duplicate: MinHash LSH (SimHash)
  - Semantic: embedding-based dedup
    ↓
PII Removal (emails, phone numbers, SSNs)
    ↓
Tokenization & Packing
    ↓
Binary format (numpy memmap, HDF5, WebDataset)

Data Mixing & Weighting

Domain weighting: upweight high-quality sources
Data mixing ratios (e.g., 80% web, 10% code, 5% books, 5% science)
Data flywheels: use trained model to filter better data
DSIR (Data Selection via Importance Resampling) — target-aware
DoReMi — automatic domain weight optimization

8.2 Model Architecture Configuration

Hyperparameter Selection Table

Model Size | d_model | n_layers | n_heads | d_ff    | Params
-----------|---------|----------|---------|---------|--------
125M       | 768     | 12       | 12      | 3072    | ~125M
1.3B       | 2048    | 24       | 16      | 8192    | ~1.3B
7B         | 4096    | 32       | 32      | 11008   | ~7B
13B        | 5120    | 40       | 40      | 13824   | ~13B
30B        | 6656    | 60       | 52      | 17920   | ~30B
70B        | 8192    | 80       | 64      | 28672   | ~70B
175B(GPT3) | 12288   | 96       | 96      | 49152   | ~175B

8.3 Pre-Training

Causal Language Modeling Objective

L_CLM = -Σ log P(x_t | x_1, ..., x_{t-1})

For each sequence: predict next token given all previous tokens
Cross-entropy loss averaged over all positions

Training Configuration

Batch size: typically 256–4096 sequences
Sequence length: 2048–8192 tokens per sequence
Global batch size: micro_batch × grad_accum × world_size
Learning rate schedule:
- Linear warmup (1000–2000 steps)
- Cosine decay to lr_min = 0.1 × lr_max
- lr_max typically 1e-4 to 3e-4
Weight decay: 0.1 (AdamW standard)
Gradient clipping: clip at 1.0 norm
β1=0.9, β2=0.95, ε=1e-8 (Adam hyperparameters for LLMs)

Training Stability Techniques

Loss spikes: reduce lr, check data quality at spike step
Gradient norm monitoring: track throughout training
Loss divergence recovery: reload checkpoint, skip data batch
Z-loss regularization: penalize large logit magnitudes
QK Norm: normalize Q and K before attention score computation
Checkpoint averaging: average last N checkpoints for stability

8.4 Distributed Training

Data Parallelism (DP)

Replicate model on each GPU
Each GPU processes different batch
Synchronize gradients via AllReduce after backward
DDP (PyTorch), Horovod
FSDP (Fully Sharded Data Parallel) — ZeRO Stage 3

ZeRO (Zero Redundancy Optimizer) — DeepSpeed

Stage 0: Baseline DDP (model replicated)
Stage 1: Shard optimizer states across GPUs
Stage 2: Shard optimizer states + gradients
Stage 3: Shard optimizer states + gradients + parameters
         (full model sharding — needed for 70B+ on 8 GPUs)
ZeRO-Infinity: offload to CPU/NVMe for extreme scale

Tensor Parallelism (TP) — Megatron-LM

Split individual weight matrices across GPUs
Column parallel: split W along output dimension
Row parallel: split W along input dimension
Requires AllReduce at each forward/backward
Best for very large layers (d_model = 8192+)

Pipeline Parallelism (PP)

Assign layers to different GPUs/nodes
GPipe: micro-batches flow through pipeline
PipeDream: 1F1B (one forward, one backward) schedule
Bubble overhead: (p-1)/(m+p-1) for p stages, m micro-batches
Interleaved pipeline: reduces bubble, increases memory

Sequence Parallelism (SP)

Distribute long sequence across devices
Each device handles chunk of sequence length
Ring Attention: pass KV around ring of devices
Useful for 100K+ context training

3D Parallelism (Megatron-DeepSpeed)

Combine DP + TP + PP for training 100B+ models
Example: 175B on 1024 GPUs: DP=8, TP=8, PP=16

8.5 Mixed Precision Training

FP32 — full precision, safe but 2× memory vs FP16
FP16 — 5-bit exponent, 10-bit mantissa, can overflow
BF16 — 8-bit exponent, 7-bit mantissa, same range as FP32
- Preferred for LLM training (no loss scaling needed)
AMP (Automatic Mixed Precision):
- Keep master weights in FP32
- Forward/backward in FP16/BF16
- Update master FP32 weights
FP8 Training — H100 native, needs careful scaling
- Transformer Engine (NVIDIA) handles FP8 automatically

8.6 Checkpointing & Recovery

Save every N steps (N = 500–2000 typically)
Checkpoint includes: model weights, optimizer states, scheduler state, RNG state
Activation Checkpointing — recompute activations during backward to save memory
- Trade 33% extra compute for ~10× memory savings
Selective Activation Checkpointing — checkpoint only expensive ops
Distributed checkpoint sharding (each rank saves its own shard)

9. RLHF, Alignment & Fine-Tuning

9.1 Supervised Fine-Tuning (SFT)

Collect instruction-response pairs (hundreds of thousands)
Data formats:
- Alpaca format: instruction/input/output
- ShareGPT format: multi-turn conversations
- FLAN/T0: task-specific instruction templates
Fine-tune with teacher forcing on completions only
Mask loss on prompt tokens, compute only on response
Key datasets: OpenAssistant, Dolly, FLAN, WizardLM, UltraChat

9.2 RLHF Pipeline (Reinforcement Learning from Human Feedback)

Step 1: SFT Model (instruction-following base)
    ↓
Step 2: Preference Data Collection
  Human annotators compare 2+ model outputs
  Rank: A > B or A = B
  Collect ~50K–1M comparisons
    ↓
Step 3: Reward Model Training
  Bradley-Terry model:
  L_RM = -E[log σ(r(x, y_w) - r(x, y_l))]
  where y_w = preferred response, y_l = rejected
    ↓
Step 4: PPO Training
  Maximize: E[r(x, y)] - β * KL(π_θ || π_ref)
  KL penalty prevents model from deviating too far from SFT

PPO (Proximal Policy Optimization) Details

L_CLIP = E[min(r_t(θ) * A_t, clip(r_t(θ), 1-ε, 1+ε) * A_t)]

r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)  (probability ratio)
A_t = advantage estimate (GAE: Generalized Advantage Estimation)
ε = 0.2 (clipping parameter)

Value function:
L_VF = E[(V_θ(s_t) - V_target)²]

Total loss:
L = L_CLIP - c1 * L_VF + c2 * S[π_θ](s_t)  (entropy bonus)

9.3 DPO (Direct Preference Optimization) — Simpler RLHF

L_DPO = -E[log σ(β * (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))]

Advantages over RLHF-PPO:
- No separate reward model needed
- More stable training
- Simpler implementation
- Comparable or better results

Variants: IPO, KTO, ORPO, SimPO, CPO

9.4 Parameter-Efficient Fine-Tuning (PEFT)

LoRA (Low-Rank Adaptation)

W = W_0 + ΔW = W_0 + BA

Where B ∈ R^(d×r), A ∈ R^(r×k), r << min(d,k)
Typically r = 4, 8, 16, 32, 64

Number of trainable params: r*(d+k) vs d*k
Reduction: (r*(d+k)) / (d*k) ≈ r/min(d,k)
Example: 4096×4096 layer, r=16: 99.8% param reduction

QLoRA — quantize base model to 4-bit, train LoRA adapters
- NF4 quantization (Normal Float 4-bit)
- Double quantization: quantize quantization constants too
- Paged Optimizers: offload optimizer states to CPU RAM
LoRA+ — different learning rates for A and B matrices
DoRA — decompose into magnitude + direction components
LoRA-FA — frozen A matrix, only train B
AdaLoRA — adaptive rank allocation per layer
PiSSA — principal singular values and singular vectors
Prefix Tuning — trainable prefix tokens prepended to each layer
P-Tuning v2 — deep prompt tuning
IA³ — rescale activations with learned vectors (< 0.1% params)

9.5 Constitutional AI (Anthropic's Claude Approach)

Define principles/constitution for model behavior
CAI-SL: fine-tune on critiques and revisions following constitution
CAI-RL: use AI feedback instead of human (RLAIF)
- Generate responses → evaluate with constitution → rank → RL
Red-teaming: adversarial probing for harmful outputs
Harmlessness + Helpfulness + Honesty triad

9.6 Model Merging

SLERP — spherical linear interpolation of model weights
TIES-Merging — trim + elect sign + disjoint merge
DARE — random pruning before merging (sparse delta weights)
Model Soup — average fine-tuned models (Wortsman et al.)
Mergekit library for practical model merging

10. Major Algorithms & Techniques Reference

10.1 Generation Algorithms

Greedy Decoding — always pick highest probability token
Beam Search — maintain top-B hypotheses at each step
Sampling — sample from probability distribution
Temperature Scaling — T < 1 = sharper; T > 1 = softer
- P'(x) = softmax(logits / T)
Top-K Sampling — sample from top K tokens only
Top-P (Nucleus) Sampling — sample from tokens summing to probability P
Min-P Sampling — minimum probability threshold relative to top token
Typical Sampling — sample tokens of typical information content
Contrastive Search — maximize: (1-α)*p(x) - α*max_j cos_sim(h_x, h_xj)
Speculative Decoding — small draft model proposes, large model verifies
- ~2-4× speedup with no quality loss
Medusa — parallel draft heads on single model

10.2 Inference Optimization

KV Cache — store past key-value pairs, O(1) per new token
Continuous Batching — dynamic batching, no waiting for sequences to finish
PagedAttention — virtual memory for KV cache
Quantization:
- PTQ (Post-Training Quantization): GPTQ, AWQ, SmoothQuant
- QAT (Quantization-Aware Training): fp8, int8 training
- W4A16: 4-bit weights, 16-bit activations (most common)
- GGUF format: llama.cpp quantization (Q4_K_M, Q5_K_S, etc.)
Weight Sharing/Tying — tie embedding and output projection
Knowledge Distillation — small student learns from large teacher
- Response distillation: match output distributions
- Feature distillation: match intermediate representations
Pruning:
- Magnitude pruning: remove small-weight connections
- Structured pruning: remove entire heads/layers
- SparseGPT: one-shot unstructured pruning for GPT models

10.3 Long Context Techniques

Sliding Window Attention — Mistral's local attention
LongFormer — local + global attention tokens
BigBird — random + local + global attention
ALiBi — linear bias enables zero-shot length generalization
RoPE scaling variants:
- Linear interpolation (position_id / scale_factor)
- NTK-aware interpolation
- YaRN (Yet Another RoPE Extension)
- LongRoPE (progressive rescaling)
Retrieval Augmented Generation (RAG):
- Dense retrieval (DPR, E5, BGE embeddings)
- Sparse retrieval (BM25)
- Hybrid retrieval
- Reranking (ColBERT, cross-encoder)

10.4 Reasoning & Chain of Thought

Chain-of-Thought (CoT) prompting — "Let's think step by step"
Self-Consistency — sample multiple CoT paths, majority vote
Tree of Thought (ToT) — tree search over reasoning steps
Graph of Thought — arbitrary DAG of reasoning
Program-Aided Language Models (PAL) — generate executable code
ReAct — interleaved reasoning and action (tool use)
Process Reward Models (PRM) — reward each reasoning step
Outcome Reward Models (ORM) — reward final answer only
MCTS for LLM reasoning — Monte Carlo Tree Search guided by PRM
o1/o3-style reasoning — long chain-of-thought with test-time compute scaling

11. Tools, Frameworks & Libraries

11.1 Deep Learning Frameworks

Framework	Use Case	Key Feature
PyTorch	Research & production	Dynamic graphs, pythonic
JAX	Google TPU training	XLA compilation, functional
TensorFlow	Production deployment	TF Serving, TFLite
MXNet	AWS ecosystem	Gluon API
PaddlePaddle	Baidu ecosystem	Chinese NLP focus

11.2 LLM Training Frameworks

Framework	Organization	Best For
Megatron-LM	NVIDIA	Large-scale 3D parallel training
DeepSpeed	Microsoft	ZeRO optimization, ZeRO-Infinity
FSDP	Meta/PyTorch	Simpler full sharding
Colossal-AI	HPC-AI Tech	Heterogeneous training
Alpa	UCB/Google	Auto-parallelism
LLaMA-Factory	Community	Fine-tuning factory
Axolotl	OpenAccess	YAML-configured fine-tuning
TRL	HuggingFace	RLHF/DPO training
OpenRLHF	OpenLLMAI	Scalable RLHF
Nanotron	HuggingFace	Lightweight pre-training

11.3 Inference Frameworks

Framework	Focus	Key Feature
vLLM	High throughput	PagedAttention, continuous batching
TGI (Text Generation Inference)	HuggingFace	Production API
llama.cpp	Local/edge	CPU inference, GGUF
Ollama	Local deployment	Easy model management
TensorRT-LLM	NVIDIA GPU	TensorRT optimized kernels
MLC-LLM	Multi-platform	Web, mobile, server
ExLlamaV2	Consumer GPU	GPTQ inference
CTransformers	Python bindings	llama.cpp Python
LightLLM	Triton kernels	FlashAttention2
SGLang	Structured generation	RadixAttention

11.4 Model Hub & Ecosystem

HuggingFace Hub — 500K+ models, datasets, spaces
- Transformers library: universal model API
- Datasets library: 50K+ datasets
- PEFT library: LoRA, prefix tuning, etc.
- Accelerate: multi-GPU/TPU training
- Tokenizers: fast Rust tokenizers
PyTorch Hub — model repository
Weights & Biases (WandB) — experiment tracking
MLflow — experiment tracking + model registry
DVC — data version control
LangChain — LLM application framework
LlamaIndex — RAG and data indexing
Haystack — NLP pipeline framework

11.5 Data Processing

Apache Spark — distributed data processing
Ray — distributed Python, Ray Data
DataTrove — HuggingFace data processing pipeline
The Stack deduplication — MinHash LSH at scale
SentenceTransformers — embedding models
FAISS — fast ANN search for vectors
Elasticsearch — BM25 + vector search

11.6 Evaluation Frameworks

Benchmark	Tests	Size
MMLU	World knowledge, 57 subjects	14K questions
HellaSwag	Commonsense reasoning	70K examples
HumanEval	Code generation	164 problems
MBPP	Python programming	500 problems
GSM8K	Grade school math	8.5K problems
MATH	Competition math	12.5K problems
ARC-Challenge	Science QA	1.2K questions
TruthfulQA	Factual accuracy	817 questions
BIG-Bench	Diverse reasoning	204 tasks
MT-Bench	Chat multi-turn	80 questions
Chatbot Arena	Human preference	1M+ votes
lm-evaluation-harness	EleutherAI eval framework	All benchmarks

12. Hardware Requirements by Model Type

12.1 GPU Reference Table

Consumer GPUs

GPU	VRAM	Memory BW	TFLOPs (BF16)	Best Use
RTX 3080	10GB	760 GB/s	29.8	Inference ≤7B
RTX 3090	24GB	936 GB/s	35.6	Inference ≤13B
RTX 4080	16GB	736 GB/s	48.7	Inference ≤13B
RTX 4090	24GB	1008 GB/s	82.6	Inference/fine-tune ≤33B
RTX 5090	32GB	1792 GB/s	209	Inference/fine-tune ≤70B

Data Center GPUs

GPU	VRAM	Memory BW	TFLOPs (BF16)	NVLink	Best Use
A100 40GB	40GB	1.6 TB/s	312	Yes	Training ≤13B
A100 80GB	80GB	2.0 TB/s	312	Yes	Training ≤30B
H100 SXM	80GB	3.35 TB/s	989	Yes	Training ≤70B
H100 NVL	94GB	3.9 TB/s	1979	Yes	Large models
H200	141GB	4.8 TB/s	1979	Yes	70B+ training
B200	192GB	8 TB/s	~4500	Yes	Frontier models

Multi-GPU Requirements for Training

Model Size | FP16/BF16 Memory | 8×A100 40GB | 8×A100 80GB | 8×H100
-----------|------------------|-------------|-------------|--------
7B params  | ~14GB (weights)   | Yes         | Yes         | Yes
13B        | ~26GB            | With ZeRO3  | Yes         | Yes
30B        | ~60GB            | With ZeRO3  | With ZeRO3  | Yes
70B        | ~140GB           | No          | With ZeRO3  | Yes
175B       | ~350GB           | No          | 4 nodes     | 2 nodes
405B       | ~810GB           | No          | No          | 4+ nodes

12.2 Memory Math

Model Parameters Memory (bytes):
- FP32: 4 bytes/param
- BF16/FP16: 2 bytes/param
- INT8: 1 byte/param
- INT4/NF4: 0.5 bytes/param

Training Memory = model + gradients + optimizer states
  - SGD: model + gradients = 2×model
  - Adam/AdamW: model + gradients + 2×optimizer = 4×model
  - AMP: model(fp16) + model(fp32 master) + gradients + optimizer
         = 2 + 4 + 2 + 8 = 16 bytes/param

Inference Memory = model + KV_cache + activations
  KV cache = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes

Example: 7B model training with AdamW in BF16:
  7B × 16 bytes = 112GB → needs 2× A100 80GB with ZeRO

12.3 TPU (Google)

TPU v4: 275 TFLOPS BF16, 32GB HBM, 600 GB/s
TPU v5e: purpose-built inference, 4× efficiency vs v4
TPU v5p: training powerhouse, 459 TFLOPS
Cloud TPU Pods: 4096 chips interconnected (exaFLOP scale)
Native JAX/XLA support; PyTorch via torch_xla

12.4 Infrastructure

Networking: InfiniBand (400 Gb/s) between nodes, NVLink within node
Storage: Lustre parallel filesystem, AWS FSx, GCS
- Read bandwidth: 1-100 GB/s for efficient data loading
CPU: AMD EPYC/Intel Xeon for data preprocessing
RAM: 512GB–2TB per node for large batch, ZeRO-Infinity offload
NVMe: 30+ TB fast local storage for checkpoint/cache
Power: H100 SXM: 700W; full node (8× H100): ~10kW

13. Architecture Designs — Working Principles

13.1 Complete Transformer Forward Pass (Decoder-Only)

INPUT: Token sequence [t_1, t_2, ..., t_n]

STEP 1: Embedding
  x = Embedding(tokens) + PositionalEncoding(positions)
  x ∈ R^(n × d_model)

STEP 2: For each of L transformer layers:

  a) Layer Norm (Pre-Norm)
     x_norm = LN(x)   or   RMSNorm(x)
  
  b) Causal Self-Attention
     Q = x_norm @ W_Q    K = x_norm @ W_K    V = x_norm @ W_V
     
     Apply RoPE to Q and K:
     Q, K = apply_rotary_embedding(Q, K, positions)
     
     Split into h attention heads
     
     For each head i:
       A_i = softmax((Q_i @ K_i^T) / √d_k + causal_mask) @ V_i
     
     Concatenate: A = [A_1; A_2; ...; A_h]
     Output: Attn_out = A @ W_O
     
     Residual: x = x + Attn_out
  
  c) Layer Norm (Pre-Norm again)
     x_norm2 = LN(x)   or   RMSNorm(x)
  
  d) Feed-Forward Network (SwiGLU)
     gate = SiLU(x_norm2 @ W_gate)
     up   = x_norm2 @ W_up
     FFN  = (gate * up) @ W_down
     
     Residual: x = x + FFN

STEP 3: Final Layer Norm
  x = RMSNorm(x)

STEP 4: Language Model Head (Linear projection + Softmax)
  logits = x @ W_lm_head     (or use tied embedding weights)
  probs  = softmax(logits / temperature)

STEP 5: Sample/select next token
  next_token = sample(probs)  or  argmax(probs)

13.2 GPT Architecture (Decoder-Only)

Unidirectional attention (causal mask)
Predict next token: P(x_t | x_1...x_{t-1})
GPT-1: 12 layers, 768 d_model, 117M params
GPT-2: 48 layers, 1600 d_model, 1.5B params
GPT-3: 96 layers, 12288 d_model, 175B params
GPT-4: MoE, ~8×220B, estimated ~1.8T params (unconfirmed)
Training: Causal LM, web text
No official architecture paper for GPT-4

13.3 Llama Architecture Details

Llama 1/2: RoPE, RMSNorm, SwiGLU FFN, GQA (Llama 2)
Llama 3: 128K vocab, GQA, 128K context
Architecture difference from GPT:
- No biases in linear layers
- RMSNorm instead of LayerNorm (no mean subtraction)
- RoPE instead of absolute PE
- SwiGLU with 3 matrices (up, gate, down) instead of 2
- GQA: fewer KV heads than query heads

13.4 BERT Architecture (Encoder-Only)

Bidirectional attention (all tokens attend to all)
Pre-training objectives:
- MLM: mask 15% tokens, predict them
- NSP: predict if sentence B follows sentence A
Fine-tune on downstream tasks with task head
CLS token embedding → classification

13.5 T5 Architecture (Encoder-Decoder)

Encoder: bidirectional full attention
Decoder: causal self-attention + cross-attention to encoder
All tasks as text-to-text: "Translate: ..." → "..."
Relative positional biases instead of absolute PE

13.6 Mamba / SSM Architecture (Alternative to Transformer)

State Space Model Core:
h'(t) = A * h(t) + B * x(t)
y(t) = C * h(t) + D * x(t)

Discretized:
h_t = Ā * h_{t-1} + B̄ * x_t
y_t = C * h_t

Mamba adds: selective scan mechanism (S4 + selectivity)
Δ, B, C are input-dependent (unlike fixed SSMs)

Advantages:
- Linear O(N) training time
- O(1) memory per inference step
- Competitive with Transformers on long sequences

Mamba 2 — parallel scan, SSD (Structured State Space Duality)
Jamba — interleaved Mamba + Transformer layers (AI21)
Falcon Mamba — pure SSM language model

13.7 Mixture of Experts (MoE) Architecture

Expert Router:
  g(x) = Softmax(TopK(x @ W_router))
  
Each token routes to K experts:
  output = Σ g_k(x) * Expert_k(x)

Load Balancing:
  L_aux = α * Σ_i f_i * P_i
  f_i = fraction of tokens routed to expert i
  P_i = fraction of router probability on expert i

Mixtral 8×7B: 8 FFN experts, 2 active; effectively 12.9B active
DeepSeek-V3: 256 experts, 8 active + 1 shared expert always active
Switch Transformer: top-1 routing, simpler but less expressive

14. Complete Design & Development Process

14.1 Phase 0: Problem Definition & Scoping (Weeks 1-2)

Decision Framework:
├── Model Purpose
│   ├── General assistant (broad capability)
│   ├── Domain specialist (legal, medical, code)
│   ├── Multilingual (cover which languages?)
│   └── Multimodal (text + image + audio?)
├── Scale Decision
│   ├── <1B: edge deployment, specialized tasks
│   ├── 1-7B: consumer hardware, good balance
│   ├── 7-70B: server deployment, high capability
│   └── 70B+: frontier capability, data center
├── Compute Budget
│   ├── GPU-hours × cost → max tokens you can train
│   ├── Use Chinchilla formula for optimal allocation
│   └── Factor in inference cost at scale
└── Success Metrics
    ├── Benchmark targets (MMLU, HumanEval, etc.)
    ├── Latency requirements (tokens/sec)
    └── Cost per token at serving scale

14.2 Phase 1: Data Pipeline (Months 1-2)

Step 1: Data Acquisition
  - Download Common Crawl (use cc-net or datatrove)
  - Acquire books, code, scientific papers
  - License check everything

Step 2: Setup Processing Infrastructure
  pip install datatrove apache-beam
  # Spark cluster or Ray cluster for scale

Step 3: Implement Quality Pipeline
  quality_pipeline = [
    URLFilter(block_list=ADULT_DOMAINS),
    LanguageFilter(languages=["en"], min_prob=0.65),
    GopherQualityFilter(min_words=50, max_ratio_bullet_lines=0.9),
    C4QualityFilter(),
    ParagraphFilter(min_paragraphs=3),
  ]

Step 4: Deduplication
  minhash_dedup = MinHashDedup(
      n_shingles=5,
      n_buckets=14,
      n_hashes_per_bucket=8,
      threshold=0.7
  )

Step 5: Tokenize & Pack
  # BPE tokenizer training
  tokenizer = Tokenizer(BPE(unk_token="<unk>"))
  tokenizer.train(files=corpus_files, vocab_size=32000)
  
  # Pack sequences to max_length, add BOS/EOS
  # Use numpy memmap for efficient storage

14.3 Phase 2: Model Implementation (Month 2-3)

# Complete minimal Llama-style transformer implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass

@dataclass
class ModelConfig:
    vocab_size: int = 32000
    d_model: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: int = 8          # GQA
    max_seq_len: int = 4096
    ffn_dim: int = 14336
    rms_norm_eps: float = 1e-5

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        norm = x.pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
        return x * norm * self.weight

def precompute_freqs(dim, max_len, theta=10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(max_len)
    freqs = torch.outer(t, freqs)
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
    return freqs_cis

def apply_rotary_emb(xq, xk, freqs_cis):
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = freqs_cis[:xq.shape[1]].unsqueeze(0).unsqueeze(2)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

class Attention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_heads = config.n_heads
        self.n_kv_heads = config.n_kv_heads
        self.head_dim = config.d_model // config.n_heads
        self.n_rep = self.n_heads // self.n_kv_heads
        
        self.wq = nn.Linear(config.d_model, config.n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(config.d_model, config.n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(config.d_model, config.n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(config.n_heads * self.head_dim, config.d_model, bias=False)
    
    def forward(self, x, freqs_cis, mask=None):
        B, T, _ = x.shape
        xq = self.wq(x).view(B, T, self.n_heads, self.head_dim)
        xk = self.wk(x).view(B, T, self.n_kv_heads, self.head_dim)
        xv = self.wv(x).view(B, T, self.n_kv_heads, self.head_dim)
        
        xq, xk = apply_rotary_emb(xq, xk, freqs_cis)
        
        # Expand KV for GQA
        xk = xk.repeat_interleave(self.n_rep, dim=2)
        xv = xv.repeat_interleave(self.n_rep, dim=2)
        
        # Flash Attention via scaled_dot_product_attention
        xq = xq.transpose(1, 2)
        xk = xk.transpose(1, 2)
        xv = xv.transpose(1, 2)
        
        out = F.scaled_dot_product_attention(xq, xk, xv, 
                                              attn_mask=mask, 
                                              is_causal=True)
        out = out.transpose(1, 2).contiguous().view(B, T, -1)
        return self.wo(out)

class SwiGLU(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.w1 = nn.Linear(config.d_model, config.ffn_dim, bias=False)
        self.w2 = nn.Linear(config.ffn_dim, config.d_model, bias=False)
        self.w3 = nn.Linear(config.d_model, config.ffn_dim, bias=False)
    
    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = Attention(config)
        self.feed_forward = SwiGLU(config)
        self.attention_norm = RMSNorm(config.d_model, config.rms_norm_eps)
        self.ffn_norm = RMSNorm(config.d_model, config.rms_norm_eps)
    
    def forward(self, x, freqs_cis, mask=None):
        x = x + self.attention(self.attention_norm(x), freqs_cis, mask)
        x = x + self.feed_forward(self.ffn_norm(x))
        return x

class Transformer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.embeddings = nn.Embedding(config.vocab_size, config.d_model)
        self.layers = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)])
        self.norm = RMSNorm(config.d_model, config.rms_norm_eps)
        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
        
        # Tie weights
        self.lm_head.weight = self.embeddings.weight
        
        # Precompute RoPE frequencies
        self.freqs_cis = precompute_freqs(config.d_model // config.n_heads, config.max_seq_len)
    
    def forward(self, tokens, targets=None):
        B, T = tokens.shape
        x = self.embeddings(tokens)
        freqs_cis = self.freqs_cis[:T].to(x.device)
        
        for layer in self.layers:
            x = layer(x, freqs_cis)
        
        x = self.norm(x)
        logits = self.lm_head(x)
        
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        
        return logits, loss

14.4 Phase 3: Training Infrastructure (Month 3-4)

# Training loop with FSDP + gradient accumulation

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools

def setup_training(config, model, train_dataset):
    # FSDP wrapping
    auto_wrap_policy = functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls={TransformerBlock}
    )
    model = FSDP(model, auto_wrap_policy=auto_wrap_policy,
                 mixed_precision=MixedPrecision(
                     param_dtype=torch.bfloat16,
                     reduce_dtype=torch.bfloat16,
                     buffer_dtype=torch.bfloat16,
                 ))
    
    # Optimizer
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=3e-4,
        betas=(0.9, 0.95),
        eps=1e-8,
        weight_decay=0.1
    )
    
    # Scheduler: warmup + cosine decay
    def lr_lambda(step):
        if step < warmup_steps:
            return step / warmup_steps
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return max(0.1, 0.5 * (1 + math.cos(math.pi * progress)))
    
    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
    
    return model, optimizer, scheduler

# Training step
def train_step(model, batch, optimizer, scaler, grad_accum_steps):
    tokens, targets = batch
    
    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        logits, loss = model(tokens, targets)
        loss = loss / grad_accum_steps
    
    loss.backward()
    
    if step % grad_accum_steps == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
    
    return loss.item() * grad_accum_steps

14.5 Phase 4: Evaluation & Iteration (Ongoing)

Evaluation Checkpoints:
  Every 1000 steps:
    - Validation loss (held-out data)
    - Perplexity on test set
  
  Every 5000 steps:
    - Run lm-evaluation-harness on core benchmarks
    - MMLU, HellaSwag, ARC, TruthfulQA
  
  After pre-training completes:
    - Full benchmark suite
    - Human evaluation samples
    - Red-teaming for safety issues

14.6 Phase 5: Post-Training (Month 5-6)

1. Collect SFT data:
   - Buy/license instruction datasets
   - Use GPT-4 to generate synthetic data
   - Human annotators for quality examples
   
2. Fine-tune with TRL/Axolotl:
   accelerate launch train_sft.py \
     --model_name_or_path base_model/ \
     --dataset_name sft_data \
     --max_seq_length 4096 \
     --num_train_epochs 3 \
     --per_device_train_batch_size 4 \
     --gradient_accumulation_steps 4

3. Collect preference data:
   - Sample 2+ outputs for each prompt
   - Human annotators rank outputs
   - Tools: LabelStudio, Argilla, Scale AI

4. Train reward model:
   python train_reward_model.py \
     --model sft_model/ \
     --data preference_data.json
   
5. RLHF/DPO:
   python train_dpo.py \
     --model sft_model/ \
     --reward_model rm_model/ \
     --beta 0.1

15. Reverse Engineering Existing LLMs

15.1 Approach & Methodology

Reverse engineering modern LLMs means studying their papers, open implementations, and behavioral analysis to understand design decisions.

15.2 Reverse Engineering GPT-4 (What We Know)

Architecture (from papers + leaks):
- Mixture of Experts: ~8 experts, 2 active per token
- Estimated 1.8T total params, ~200B active per forward pass
- ~120 transformer layers
- Context: 128K tokens (GPT-4 Turbo)
Training Data: ~13T tokens estimated
RLHF: Extensive human feedback + InstructGPT methodology
Safety: Constitutional AI-like red-teaming
Multimodal: CLIP-style vision encoder + projection

15.3 Reverse Engineering Llama 3.1 (Open Weights)

# Inspect Llama 3.1 70B architecture
from transformers import AutoModelForCausalLM, AutoConfig

config = AutoConfig.from_pretrained("meta-llama/Meta-Llama-3.1-70B")
print(config)

# Key findings:
# hidden_size: 8192
# intermediate_size: 28672
# num_attention_heads: 64
# num_key_value_heads: 8   ← GQA (8 KV heads vs 64 Q heads)
# num_hidden_layers: 80
# rope_theta: 500000.0
# vocab_size: 128256
# max_position_embeddings: 131072
# rms_norm_eps: 1e-05
# hidden_act: "silu"

15.4 Behavioral Reverse Engineering

Techniques:
1. Prompt probing — test specific capabilities systematically
2. Activation patching — identify which layers encode which info
   (requires white-box access or similar open model)
3. Mechanistic interpretability:
   - Identify attention head functions (induction heads, copy heads)
   - Superposition hypothesis: polysemantic neurons
   - Sparse autoencoders to find features (Anthropic's SAE work)
4. Logit lens — project intermediate representations to vocab
5. Activation analysis — t-SNE/UMAP of hidden states
6. Probing classifiers — train linear probes on hidden states

15.5 Studying Open Source LLMs

Key open models to study (in order of insight value):

1. GPT-2 (117M) — OpenAI, fully open, educational
   git clone https://github.com/openai/gpt-2

2. LLaMA 3 (8B-405B) — Meta, open weights + tokenizer details
   Excellent reference architecture

3. Mistral 7B — reference for sliding window + GQA
   
4. Falcon (1B-180B) — Technology Innovation Institute
   Original GQA + MQA reference

5. Pythia (70M-12B) — EleutherAI, training checkpoints available
   Study training dynamics over time

6. OLMo (7B) — Allen AI, truly open (code + data + checkpoints)
   Best for training process study

7. MosaicML MPT — HuggingFace-native architecture

Study approach:
- Read architecture paper
- Clone training codebase
- Trace forward pass manually
- Measure parameter counts per component
- Profile memory and compute requirements

16. Building Your Own LLM Service

16.1 Service Architecture Overview

                     ┌─────────────────────┐
                     │    Load Balancer     │
                     │   (nginx/Traefik)    │
                     └──────────┬──────────┘
                                │
          ┌─────────────────────┼─────────────────────┐
          │                     │                     │
 ┌────────▼───────┐   ┌────────▼───────┐   ┌────────▼───────┐
 │   API Server   │   │   API Server   │   │   API Server   │
 │  (FastAPI)     │   │  (FastAPI)     │   │  (FastAPI)     │
 └────────┬───────┘   └────────┬───────┘   └────────┬───────┘
          │                     │                     │
          └─────────────────────┼─────────────────────┘
                                │
                      ┌─────────▼─────────┐
                      │   Request Router   │
                      │  (Priority Queue)  │
                      └─────────┬─────────┘
                                │
          ┌─────────────────────┼─────────────────────┐
          │                     │                     │
 ┌────────▼───────┐   ┌────────▼───────┐   ┌────────▼───────┐
 │ Inference Node │   │ Inference Node │   │ Inference Node │
 │  vLLM / TGI   │   │  vLLM / TGI   │   │  vLLM / TGI   │
 │  (4×H100)      │   │  (4×H100)      │   │  (4×H100)      │
 └────────────────┘   └────────────────┘   └────────────────┘
          │
 ┌────────▼──────────────────────────────┐
 │         Supporting Services           │
 │  Redis (cache) | PostgreSQL (logs)    │
 │  Prometheus (metrics) | Grafana (viz) │
 │  MinIO (model artifacts)              │
 └────────────────────────────────────────┘

16.2 API Layer Implementation

# FastAPI server for LLM service
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import asyncio
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams

app = FastAPI(title="LLM API Service")

# Initialize vLLM engine
engine_args = AsyncEngineArgs(
    model="your_model_path",
    tensor_parallel_size=4,    # 4 GPUs
    gpu_memory_utilization=0.95,
    max_num_batched_tokens=32768,
    max_num_seqs=256,
    enable_chunked_prefill=True,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

class ChatRequest(BaseModel):
    messages: list[dict]
    max_tokens: int = 2048
    temperature: float = 0.7
    top_p: float = 0.9
    stream: bool = False

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    # Apply chat template
    prompt = apply_chat_template(request.messages)
    
    sampling_params = SamplingParams(
        max_tokens=request.max_tokens,
        temperature=request.temperature,
        top_p=request.top_p,
    )
    
    if request.stream:
        return StreamingResponse(
            stream_generator(prompt, sampling_params),
            media_type="text/event-stream"
        )
    
    # Non-streaming
    results = await engine.generate(prompt, sampling_params, request_id=str(uuid4()))
    async for result in results:
        final_output = result
    
    return format_openai_response(final_output)

async def stream_generator(prompt, sampling_params):
    async for output in engine.generate(prompt, sampling_params, str(uuid4())):
        chunk = format_stream_chunk(output)
        yield f"data: {json.dumps(chunk)}\n\n"
    yield "data: [DONE]\n\n"

16.3 Deployment with Kubernetes

# k8s deployment for LLM inference
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-inference
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model
          - /models/llama-3-70b
          - --tensor-parallel-size
          - "4"
          - --max-num-batched-tokens
          - "32768"
          - --port
          - "8000"
        resources:
          limits:
            nvidia.com/gpu: 4
          requests:
            memory: "200Gi"
            cpu: "32"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
      nodeSelector:
        nvidia.com/gpu.product: "H100-SXM-80GB"

16.4 Monitoring & Observability

# Prometheus metrics for LLM service
from prometheus_client import Counter, Histogram, Gauge

REQUEST_COUNT = Counter('llm_requests_total', 'Total requests', ['model', 'status'])
REQUEST_LATENCY = Histogram('llm_request_latency_seconds', 
                             'Request latency', ['model'],
                             buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0])
TOKENS_GENERATED = Counter('llm_tokens_generated_total', 'Tokens generated', ['model'])
GPU_MEMORY_USED = Gauge('llm_gpu_memory_bytes', 'GPU memory used', ['gpu_id'])
QUEUE_SIZE = Gauge('llm_queue_size', 'Current queue depth')

16.5 Cost Estimation

Infrastructure Cost Example (7B model, 100K daily users):
  
Serving: 2× 8×A100 nodes (AWS p4d.24xlarge)
  Cost: ~$32/hr/node × 2 = $64/hr = $1,536/day = $46K/month

Storage: 100TB (model, logs, cache)
  Cost: ~$2,300/month (S3)

Network: 10TB outbound/day
  Cost: ~$900/month

Training (one-time, 7B model):
  ~7B params × 140B tokens / (312 TFLOPS × 0.4 efficiency)
  ≈ 9.6M GPU-hours on A100 → Actually ~300K GPU-hours
  Cost: ~$300K one-time for quality 7B model

Break-even: ~$0.001/1K tokens at scale

17. Cutting-Edge Developments

17.1 Test-Time Compute Scaling (2024-2025)

OpenAI o1/o3 — extended chain-of-thought reasoning
- Models "think" for seconds to minutes before answering
- Process Reward Models (PRMs) guide reasoning
- MCTS/beam search over reasoning steps
- Breakthrough on AIME math, competition programming
DeepSeek-R1 — open-source reasoning model
- GRPO training (Group Relative Policy Optimization)
- RL directly on reasoning without SRM labeling
- Matches o1 on many benchmarks at lower cost
Test-time compute scaling law: more inference compute → better results

17.2 Multimodal LLMs

Architecture: Vision encoder → projector → LLM
- CLIP/SigLIP → Linear/MLP → Decoder-only LLM
GPT-4V/GPT-4o: images, audio, text unified
Gemini 1.5 Pro: 1M context, native multimodal
LLaVA / LLaVA-NeXT: open multimodal models
Qwen-VL: image/video understanding
Video LLMs: VideoLLaMA, Video-LLaVA, Qwen2-VL
Any-to-Any: Unified IO, CoDi, NExT-GPT

17.3 Efficient Architecture Innovations

GQA (2023) — grouped query attention, now standard
Sliding Window + Full Attention Hybrid — Mistral approach
MLA (Multi-head Latent Attention) — DeepSeek-V2/V3
- Low-rank KV compression: 93% KV cache reduction
- Match MHA quality with MQA efficiency
Differential Attention — Microsoft 2024
- Cancel noise in attention with difference of two softmax
Linear Attention / RetNet / RWKV / Mamba
- Subquadratic alternatives to standard attention
TTT (Test-Time Training) — context as gradient descent

17.4 Training Innovations

Flash Attention 3 — hardware-aware for H100 FP8
FP8 Training — native 8-bit training on H100
Online RLHF — continuously update RM with new data
RLAIF — AI feedback replacing human annotation
Constitutional AI 2.0 — multi-principle alignment
Direct Preference Optimization variants (IPO, KTO, ORPO)
Synthetic Data Generation — Phi series, Llama instillation
Curriculum Learning — easy→hard data ordering
Data Attribution — identify most influential training examples

17.5 Inference & Serving Innovations

Speculative Decoding — 2-4× speedup, no quality loss
Medusa / EAGLE — parallel decoding heads
Continuous Batching — vLLM's signature feature
Chunked Prefill — interleave prefill and decode
Prefix Caching — reuse KV cache across requests
Quantization advances: GPTQ, AWQ, AQLM, FP8 inference
MoE routing optimization — expert parallelism
Disaggregated prefill/decode — separate servers for each phase

17.6 Long Context & Memory

Retrieval Augmented Generation 2.0
- Self-RAG, FLARE, Adaptive RAG
- Multi-hop reasoning over retrieved docs
Infinite context: StreamingLLM, MemGPT, Infini-Attention
Memory networks: Titans (2025), neural long-term memory
1M context: Gemini 1.5, Claude 3 (200K), Llama 3.1 (128K)
Persistent memory systems: vector databases + LLM

17.7 Agentic AI (2024-2025)

Tool use / Function calling — structured JSON outputs
Code execution — Python interpreter as tool
Browser agents — web navigation (Computer Use, WebAgent)
Multi-agent systems — AutoGen, CrewAI, LangGraph
Long-horizon planning — hierarchical task decomposition
World models — model-based reasoning about environment

18. Build Ideas — Beginner to Advanced

🟢 Beginner Level (Months 1-6)

Project 1: GPT from Scratch (The Classic) Beginner

Goal: Build and train a character-level GPT
Skills: PyTorch basics, attention, training loop
Dataset: tiny_shakespeare.txt (~1MB)
Model: ~10K-100K parameters
Reference: Andrej Karpathy's "nanoGPT" tutorial

Project 2: Train a Tiny Tokenizer Beginner

Goal: Implement BPE tokenizer from scratch
Skills: String processing, Python
Dataset: Text corpus of your choice
Deliverable: Custom tokenizer matching tiktoken output

Project 3: BERT Fine-Tuning for Classification Beginner

Goal: Fine-tune BERT for sentiment analysis
Skills: HuggingFace Transformers, fine-tuning
Dataset: SST-2, IMDB, or custom
Deliverable: 90%+ accuracy classifier with API

Project 4: Chatbot with LoRA Fine-Tuning Beginner

Goal: Fine-tune Llama 3.1 8B on custom instructions
Skills: PEFT, QLoRA, Axolotl
Dataset: 1K-10K instruction pairs
Hardware: 1× RTX 4090 or Colab A100

Project 5: RAG System Beginner

Goal: Build retrieval-augmented Q&A over documents
Skills: Embeddings, FAISS, LangChain
Components: PDF loader → chunker → embedder → retriever → LLM

🟡 Intermediate Level (Months 6-18)

Project 6: Train a 125M Parameter LLM Intermediate

Goal: Pre-train GPT-2 sized model on domain data
Skills: Distributed training, data pipeline, evaluation
Dataset: 10-50B tokens (domain-specific)
Hardware: 4-8× A100 GPUs
Framework: Megatron-LM or custom PyTorch FSDP
Cost: ~$5K-20K compute

Project 7: Reward Model Training Intermediate

Goal: Train a reward model for RLHF
Skills: Preference data collection, Bradley-Terry model
Dataset: 50K+ comparison pairs
Deliverable: RM that scores responses 0-10
Evaluation: Accuracy on held-out comparisons

Project 8: Multimodal LLM (Vision + Text) Intermediate

Goal: Build LLaVA-style model
Architecture: CLIP ViT-L + projection MLP + Llama 3B
Training: 2-stage (align → instruction-tune)
Dataset: LLaVA-CC3M-Pretrain-595K + LLaVA-Instruct-150K
Skills: Multimodal data, vision encoder integration

Project 9: Production Inference Service Intermediate

Goal: Deploy your fine-tuned model as a production API
Components:
  - vLLM/TGI inference engine
  - FastAPI with streaming support
  - Redis for rate limiting + caching
  - Prometheus + Grafana monitoring
  - Docker Compose → Kubernetes migration
SLA: 99.9% uptime, <500ms p50 latency

Project 10: Code Generation Model Intermediate

Goal: Fine-tune or train a code-specialized LLM
Dataset: The Stack (languages you support)
Eval: HumanEval, MBPP, SWE-Bench
Features: FIM (fill-in-middle), multi-file context

🔴 Advanced Level (Months 18-36+)

Project 11: 7B Parameter Pre-training from Scratch Advanced

Goal: Train a competitive open-source 7B model
Budget: $200K-500K compute (negotiable with optimizations)
Data: 1-2T tokens of curated web + books + code
Architecture: Llama 3-style (GQA, RoPE, SwiGLU, RMSNorm)
Training: 3D parallelism on 64-128× H100s
Evaluation: Competitive with Llama 3 8B on MMLU, HellaSwag

Project 12: Full RLHF Pipeline Advanced

Goal: Complete SFT → RM → PPO pipeline
SFT: 500K high-quality instruction examples
RM: 100K preference comparisons, 75%+ agreement accuracy
PPO: Stable training, no mode collapse
Deliverable: RLHF-tuned model preferred over SFT by humans
Tools: OpenRLHF or custom PPO implementation

Project 13: Reasoning Model (o1-style) Advanced

Goal: Build a reasoning model with extended CoT
Approach 1: MCTS + PRM training
Approach 2: GRPO like DeepSeek-R1
Dataset: Math (MATH, AMC, AIME) + code problems
Metric: AIME accuracy, competition math benchmarks
Novel contribution: Improved search algorithm or reward shaping

Project 14: MoE Language Model Advanced

Goal: Build Mixtral-style MoE model
Architecture: 8 experts, top-2 routing, 7B active params
Challenge: Load balancing, expert collapse prevention
Benefit: 47B total params, only 12.9B compute
Framework: Megablocks or custom CUDA kernel

Project 15: LLM Research Contribution Advanced

Goal: Novel research contribution publishable at ACL/NeurIPS/ICLR
Ideas:
  - New attention mechanism for long context
  - Better data selection algorithm
  - Novel PEFT method
  - Interpretability finding
  - New benchmark or evaluation methodology
  - Alignment technique
  - Efficient architecture variant
Process: Baseline → ablation → comparison → writeup → submission

19. Research Papers You Must Read

Foundational

Attention Is All You Need (Vaswani et al., 2017) — The Transformer
BERT (Devlin et al., 2018) — Bidirectional pre-training
GPT-2 (Radford et al., 2019) — Language model pre-training
GPT-3 (Brown et al., 2020) — Few-shot learners, scaling
Scaling Laws for Neural LMs (Kaplan et al., 2020)

Architecture

RoFormer/RoPE (Su et al., 2021) — Rotary position embedding
ALiBi (Press et al., 2021) — Attention with linear biases
GQA (Ainslie et al., 2023) — Grouped query attention
FlashAttention (Dao et al., 2022) — IO-aware attention
FlashAttention-2 (Dao, 2023)
Mistral 7B (Jiang et al., 2023) — SWA + GQA
Mixtral (Jiang et al., 2024) — Sparse MoE
Mamba (Gu & Dao, 2023) — Linear-time sequence modeling
LLaMA (Touvron et al., 2023) and LLaMA 2 & 3

Training & Optimization

Chinchilla (Hoffmann et al., 2022) — Scaling laws revised
PaLM (Chowdhery et al., 2022) — Large-scale language modeling
Megatron-LM (Shoeybi et al., 2019) — Efficient large model training
ZeRO (Rajbhandari et al., 2020) — Memory optimization
AdamW (Loshchilov & Hutter, 2017) — Decoupled weight decay
Lion Optimizer (Chen et al., 2023)

Alignment & RLHF

InstructGPT (Ouyang et al., 2022) — RLHF for instruction following
Constitutional AI (Bai et al., 2022) — Anthropic's alignment
DPO (Rafailov et al., 2023) — Direct preference optimization
RLHF (Christiano et al., 2017) — Original RLHF paper
Self-Play Fine-Tuning (SPIN) (Chen et al., 2024)

Inference

Speculative Decoding (Leviathan et al., 2022)
vLLM / PagedAttention (Kwon et al., 2023)
GPTQ (Frantar et al., 2022) — Post-training quantization
AWQ (Lin et al., 2023) — Activation-aware quantization
QLoRA (Dettmers et al., 2023) — Efficient fine-tuning

Reasoning & Capabilities

Chain-of-Thought Prompting (Wei et al., 2022)
Self-Consistency (Wang et al., 2022)
Tree of Thoughts (Yao et al., 2023)
ReAct (Yao et al., 2022) — Reasoning + acting
DeepSeek-R1 (DeepSeek, 2025) — Open reasoning model

Recent (2024-2025)

DeepSeek-V3 (2024) — Efficient large MoE
Gemini 1.5 (2024) — 1M context
Claude 3 Technical Report — Constitutional AI advances
Llama 3 (Meta, 2024) — Technical report
Titans (2025) — Neural long-term memory

20. Complete Learning Timeline

Phase 1: Foundations (Months 1-3)

Month 1: Math & Programming
  Week 1-2: Linear algebra (3Blue1Brown + Gilbert Strang MIT)
  Week 3-4: Calculus, probability, statistics (Khan Academy + Bishop PRML)

Month 2: ML & DL Basics
  Week 1-2: Classical ML (Andrew Ng Coursera)
  Week 3-4: Deep learning (fast.ai Part 1, or d2l.ai)

Month 3: NLP & Transformers
  Week 1-2: NLP fundamentals, word vectors
  Week 3-4: Transformer from scratch + HuggingFace ecosystem
  Project: Train character-level GPT on Shakespeare

Phase 2: LLM Fundamentals (Months 4-6)

Month 4: Transformer Internals
  - Read: "Attention Is All You Need", GPT-2 paper, BERT paper
  - Implement: Multi-head attention, RoPE, RMSNorm from scratch
  - Project: Fine-tune BERT on custom classification task

Month 5: Training at Scale
  - Study: Megatron-LM, DeepSpeed ZeRO, FSDP
  - Implement: Distributed training with FSDP on 2-4 GPUs
  - Project: Train 125M GPT on ~1B token dataset

Month 6: Fine-Tuning & Alignment
  - Study: LoRA, QLoRA, SFT, DPO papers
  - Implement: LoRA adapter, QLoRA training pipeline
  - Project: Fine-tune Llama 3 8B on instruction dataset with QLoRA

Phase 3: Intermediate Skills (Months 7-12)

Month 7-8: Data Pipeline Engineering
  - Web scraping at scale, datatrove
  - Deduplication with MinHash
  - Quality filtering pipeline
  - Project: Build 10B token domain corpus

Month 9-10: Production Serving
  - vLLM deployment, FastAPI, Docker, K8s
  - Monitoring, autoscaling, caching
  - Project: Deploy fine-tuned 7B model as production API

Month 11-12: Evaluation & Benchmarking
  - Run lm-evaluation-harness
  - Build custom eval suite
  - Understand benchmarks: MMLU, HumanEval, MT-Bench
  - Project: Comprehensive eval of your model vs. baselines

Phase 4: Advanced Training (Months 13-24)

Month 13-15: Pre-training from Scratch
  - Architect and implement 7B parameter model
  - Data pipeline: 500B-1T tokens
  - 3D parallel training on H100 cluster
  - Training stability, loss monitoring, recovery

Month 16-18: Full RLHF Pipeline
  - Preference data collection tools
  - Reward model training and evaluation
  - PPO or DPO training
  - Safety evaluation + red-teaming

Month 19-21: Advanced Topics
  - Mixture of Experts
  - Multimodal extensions (vision + language)
  - Long context techniques
  - Speculative decoding

Month 22-24: Research Contribution
  - Novel technique or finding
  - Paper writing + submission
  - Open-source contribution

Phase 5: Mastery (Month 24+)

- Lead model development at company or in open source
- Publish research papers
- Build novel architectures
- Start your own AI company or project
- Contribute to frontier model development

📚 Essential Resources

Books

"Deep Learning" — Goodfellow, Bengio, Courville (free online)
"The Little Book of Deep Learning" — François Fleuret (free PDF)
"Dive into Deep Learning" (d2l.ai) — interactive, PyTorch/JAX/TF
"Pattern Recognition and Machine Learning" — Bishop
"Mathematics for Machine Learning" — Deisenroth et al. (free PDF)
"Speech and Language Processing" — Jurafsky & Martin (free PDF)
"The Alignment Problem" — Brian Christian

Online Courses

fast.ai — Practical Deep Learning for Coders (FREE)
Andrej Karpathy: Zero to Hero — YouTube (FREE) ← BEST starting point
DeepLearning.AI Specializations — Coursera
Stanford CS224N — NLP with Deep Learning (free lectures on YouTube)
Stanford CS336 — Language Modeling from Scratch (2024, free)
MIT 6.S191 — Introduction to Deep Learning

Blogs & Communities

Lilian Weng's Blog (lilianweng.github.io) — authoritative ML explainers
Sebastian Ruder's Blog — NLP research
Andrej Karpathy's Blog — insightful posts
HuggingFace Blog — practical tutorials
Anthropic Research Blog — alignment and safety
Google AI Blog, Meta AI Blog, OpenAI Blog
EleutherAI Discord — open-source LLM community
r/MachineLearning, r/LocalLLaMA
Papers with Code — benchmarks + implementations
The Gradient — accessible research explainers

GitHub Repositories to Study

karpathy/nanoGPT — minimal, educational GPT
karpathy/llm.c — LLM in raw C/CUDA
meta-llama/llama — reference Llama implementation
vllm-project/vllm — production inference
huggingface/transformers — universal model library
EleutherAI/lm-evaluation-harness — benchmarking
hiyouga/LLaMA-Factory — fine-tuning factory
microsoft/DeepSpeed — training optimization
NVIDIA/Megatron-LM — large-scale training
allenai/OLMo — fully open LLM

🎯 Key Principles for Success

Start small, scale up — Train 1M params before 1B params
Read the original papers — Blogs summarize, papers detail
Implement everything from scratch at least once — True understanding
Monitor your training obsessively — Loss curves tell you everything
Data quality > Data quantity — Garbage in, garbage out
Community is crucial — Join EleutherAI, HuggingFace Discord
Compute is the bottleneck — Use academic credits (Google TPU Research Cloud, AWS, Azure)
Reproducibility first — Track every experiment with WandB/MLflow
Safety matters from day 1 — Build alignment in, not bolted on
Iterate fast — 7B model lesson > theory about 175B model

Last Updated: 2025

Covers: Llama 3.1, DeepSeek-V3, DeepSeek-R1, Gemini 1.5, GPT-4o, Claude 3.5

Total Estimated Effort: 1500–3000 hours of focused study and implementation to reach production-grade LLM development capability