Comprehensive Compiler Design Learning Roadmap
Prerequisites & Foundations
Data Structures & Algorithms
- Trees (ASTs, parse trees, syntax trees)
- Graphs (control flow graphs, data flow graphs)
- Hash tables (symbol tables)
- Stack operations (parsing, runtime)
- String algorithms (pattern matching, lexical analysis)
Theory of Computation
- Regular expressions and finite automata
- Context-free grammars (CFGs)
- Pushdown automata
- Turing machines and computability
- Chomsky hierarchy
Programming Language Concepts
- Types and type systems
- Scope and binding
- Memory models
- Control flow structures
- Runtime environments
Frontend - Lexical & Syntax Analysis
Lexical Analysis (Scanning)
- Role of the lexer in compilation
- Token specification using regular expressions
- Finite automata (DFA/NFA)
- Converting regex to NFA to DFA
- DFA minimization algorithms
- Lexer generators and their operation
- Handling keywords, identifiers, literals
- Error detection and recovery in lexing
- Buffering techniques for efficient scanning
Syntax Analysis (Parsing)
- Context-free grammars fundamentals
- Derivations and parse trees
- Ambiguity in grammars
- Top-down parsing
- Recursive descent parsers
- LL(1) parsing with FIRST/FOLLOW sets
- Predictive parsing tables
- Bottom-up parsing
- LR(0) parsing
- SLR parsing
- LALR(1) parsing
- Canonical LR(1) parsing
- Handle pruning and shift-reduce conflicts
- Error recovery strategies
- Abstract Syntax Trees (AST) construction
- Parser generators (Yacc/Bison)
Semantic Analysis
Type Systems
- Static vs dynamic typing
- Type checking algorithms
- Type inference (Hindley-Milner)
- Polymorphism (parametric, ad-hoc, subtype)
- Type equivalence (structural vs nominal)
- Coercion and conversion rules
Symbol Tables
- Symbol table organization
- Scope management (nested scopes)
- Name resolution
- Forward references
- Storage allocation information
Semantic Actions
- Syntax-directed translation
- Attribute grammars (S-attributed, L-attributed)
- Semantic rule evaluation
- Declaration processing
- Constant folding in semantic phase
Intermediate Representations
IR Design
- Three-address code (TAC)
- Static Single Assignment (SSA) form
- Control Flow Graphs (CFG)
- Basic blocks and flow graphs
- Abstract syntax trees vs IR trees
- High-level vs low-level IR
IR Transformations
- Converting AST to IR
- CFG construction
- SSA construction (dominance frontiers)
- Converting out of SSA form
- IR normalization techniques
Code Optimization
Local Optimizations (Basic Block)
- Common subexpression elimination
- Copy propagation
- Constant folding and propagation
- Dead code elimination
- Algebraic simplifications
- Strength reduction
Global Optimizations (Control Flow)
- Reaching definitions analysis
- Available expressions analysis
- Live variable analysis
- Very busy expressions
- Dominance analysis
- Loop detection and natural loops
Dataflow Analysis Framework
- Forward vs backward analysis
- May vs must analysis
- Lattice theory basics
- Iterative dataflow algorithms
- Worklist algorithms
- Interprocedural dataflow
Advanced Optimizations
- Loop-invariant code motion
- Induction variable elimination
- Loop unrolling
- Loop fusion and fission
- Inlining
- Interprocedural constant propagation
- Escape analysis
- Pointer and alias analysis
- Partial redundancy elimination (PRE)
- Global value numbering (GVN)
- Sparse conditional constant propagation
SSA-Based Optimizations
- Sparse dataflow analysis
- SSA-based dead code elimination
- SSA-based constant propagation
- Path-sensitive optimizations
Code Generation
Instruction Selection
- Tree-pattern matching
- Dynamic programming approaches
- Maximal munch algorithm
- BURS (Bottom-Up Rewrite System)
- Instruction scheduling basics
Register Allocation
- Live range analysis
- Graph coloring algorithm
- Chaitin's algorithm
- Optimistic coloring
- Linear scan allocation
- Handling spills
- Pre-colored registers and constraints
- Coalescing and move elimination
Target Architecture Considerations
- Machine-specific instruction sets
- Calling conventions
- Register classes and constraints
- Addressing modes
- Pipeline considerations
Runtime Systems
Memory Management
- Stack allocation (activation records)
- Heap allocation strategies
- Mark and sweep
- Reference counting
- Copying collection
- Generational GC
- Incremental and concurrent GC
Linking and Loading
- Object file formats (ELF, COFF, Mach-O)
- Symbol resolution
- Relocation
- Dynamic linking
- Position-independent code (PIC)
Runtime Support
- Exception handling mechanisms
- Virtual method dispatch
- Type information at runtime (RTTI)
- Reflection support
Advanced Topics
Just-In-Time (JIT) Compilation
- Interpretation vs compilation trade-offs
- Tiered compilation strategies
- Profile-guided optimization
- On-stack replacement (OSR)
- Deoptimization techniques
Vectorization and Parallelization
- Auto-vectorization (SIMD)
- Loop vectorization
- Dependence analysis
- Parallelizing transformations
- Polyhedral optimization
Advanced Analysis
- Abstract interpretation
- Symbolic execution
- Program slicing
- Points-to analysis (Andersen's, Steensgaard's)
- Shape analysis
Domain-Specific Compilers
- DSL compiler techniques
- GPU compilers (CUDA, OpenCL)
- Quantum computing compilers
- SQL query compilation
- Regular expression compilation
Core Algorithms
Lexical Analysis
- Thompson's construction (regex to NFA)
- Subset construction (NFA to DFA)
- Hopcroft's DFA minimization
- Brzozowski's algorithm
Parsing
- Earley parser
- CYK (Cocke-Younger-Kasami) algorithm
- Packrat parsing (PEG)
- GLR parsing (generalized LR)
- Precedence climbing (operator precedence)
Type Checking & Inference
- Algorithm W (Hindley-Milner)
- Algorithm M (bidirectional type checking)
- Unification algorithm
- Subtyping algorithms
Optimization
- Kildall's dataflow framework
- Tarjan's algorithm (strongly connected components, dominators)
- Lengauer-Tarjan (fast dominators)
- Chaitin-Briggs register allocation
- Lazy code motion
- SCCP (Sparse Conditional Constant Propagation)
- Value numbering algorithms
Graph Algorithms
- Graph coloring (register allocation)
- Dominator tree construction
- Loop detection (finding back edges)
- Strongly connected components (SCC)
Key Techniques
Program Analysis
- Dataflow analysis (forward/backward, must/may)
- Control flow analysis
- Dependence analysis
- Points-to and alias analysis
- Escape analysis
- Abstract interpretation
- Symbolic execution
Transformation Techniques
- Loop transformations (unrolling, tiling, interchange)
- Outlining and inlining
- Tail call optimization
- Continuation-passing style (CPS)
- Closure conversion
- Lambda lifting
IR Techniques
- SSA construction and destruction
- Dominance frontiers
- Phi-node placement
- Critical edge splitting
- Basic block normalization
Essential Tools & Frameworks
Lexer/Parser Generators
- Flex/Lex (lexer)
- Bison/Yacc (parser)
- ANTLR (parser, multiple languages)
- Ragel (state machine compiler)
- Tree-sitter (incremental parsing)
Compiler Frameworks
- LLVM (modular compiler infrastructure)
- GCC (GNU Compiler Collection)
- JVM (Java Virtual Machine)
- V8 (JavaScript engine)
- WebAssembly toolchain
Analysis & Optimization Tools
- Souffle (Datalog engine for program analysis)
- Soot (Java optimization framework)
- WALA (program analysis for Java)
- Frama-C (formal verification)
Backend Tools
- LLVM TableGen (instruction selection)
- Peephole optimizers
- Assemblers (GNU as, NASM)
- Linkers (GNU ld, LLD)
Debugging & Testing
- Valgrind (memory debugging)
- LLDB/GDB (debuggers)
- AddressSanitizer, ThreadSanitizer
- Fuzzing tools (AFL, LibFuzzer)
Cutting-Edge Developments
Machine Learning for Compilers
- Learning heuristics for optimization passes
- Phase ordering using reinforcement learning
- Predictive modeling for compilation decisions
- AutoTVM and AutoScheduler for tensor compilers
- Neural architecture search for compiler optimizations
- Synthesizing compiler optimizations
- Superoptimization using ML
- Neural program synthesis
- Differentiable programming and compilers
Modern Language Features
- Dependent types (Idris, Agda influence)
- Linear types (Rust's ownership, session types)
- Refinement types
- Gradual typing systems
- Effect systems and algebraic effects
- Lock-free data structure compilation
- Memory model semantics (C++11, Java)
- Automatic parallelization improvements
- Async/await implementation strategies
- Actor model compilation
Hardware-Specific Compilation
- CPU-GPU co-optimization
- TPU and neural accelerator compilation
- FPGA synthesis from high-level languages
- Cross-device optimization
- Quantum circuit optimization
- Quantum IR design (QIR, OpenQASM)
- Quantum-classical hybrid compilation
- Error correction aware compilation
Security & Verification
- Verified compiler implementations (CompCert)
- Formally verified optimizations
- Proof-carrying code
- Certified compilation
- Control-flow integrity (CFI)
- Address space layout randomization (ASLR)
- Spectre/Meltdown mitigations
- Constant-time code generation
- Side-channel resistant compilation
Domain-Specific Advances
- TVM, Halide for image processing
- XLA (Accelerated Linear Algebra)
- MLIR (Multi-Level IR) ecosystem
- Triton for GPU programming
- Polyhedral compilation for deep learning
- WASM optimization techniques
- Ahead-of-time WASM compilation
- WASI (WebAssembly System Interface)
- Embedded and IoT compilation
- IDE-friendly compilation
- Incremental type checking
- Query-based compilation
- Language servers and compiler APIs
Energy & Performance
- Power consumption optimization
- Thermal-aware code generation
- Energy profiling integration
- Quality-configurable compilation
- Trading accuracy for performance/energy
- Probabilistic program analysis
Project Ideas
Beginner Projects
1. Regular Expression Engine
- Build a regex matcher using NFA/DFA
- Implement Thompson's construction
- Add capturing groups support
Skills: automata theory, basic algorithms
2. Calculator Language Compiler
- Lexer for arithmetic expressions
- Recursive descent parser
- AST construction and evaluation
- Support variables and functions
Skills: parsing fundamentals, AST manipulation
3. Simple Expression Evaluator
- Infix to postfix conversion
- Stack-based evaluation
- Add support for variables
Skills: parsing, symbol tables
4. Subset-of-C Lexer
- Tokenize a minimal C subset
- Handle keywords, identifiers, operators
- Implement proper error reporting
Skills: lexical analysis, state machines
5. Lisp/Scheme Interpreter
- S-expression parser
- Environment-based evaluation
- Implement closures and first-class functions
Skills: functional programming, interpretation basics
Intermediate Projects
6. Static Type Checker
- Type inference for a simple language
- Implement Hindley-Milner algorithm
- Support polymorphism
- Generate meaningful error messages
Skills: type systems, unification
7. Basic C Compiler (to Assembly)
- Parse C subset (functions, control flow, arrays)
- Generate x86-64 or ARM assembly
- Implement calling conventions
- Basic optimizations (constant folding)
Skills: full compilation pipeline, code generation
8. Stack-Based VM and Compiler
- Design bytecode instruction set
- Compile expressions to bytecode
- Implement VM with stack and locals
- Add garbage collection
Skills: VM design, bytecode generation
9. SSA Form Converter
- Build CFG from three-address code
- Compute dominance frontiers
- Insert phi nodes correctly
- Implement optimization on SSA
Skills: IR design, graph algorithms
10. LLVM Frontend for Custom Language
- Use LLVM C++ API
- Generate LLVM IR
- Leverage LLVM's optimization passes
- Link with runtime library
Skills: LLVM ecosystem, IR generation
11. Basic Dataflow Analyzer
- Implement reaching definitions
- Live variable analysis
- Available expressions
- Use results for dead code elimination
Skills: dataflow analysis, optimization
Advanced Projects
12. Optimizing Compiler with Multiple Passes
- Implement 5-10 optimization passes
- SSA-based optimizations
- Register allocation with spilling
- Instruction scheduling
- Benchmark against GCC/Clang
Skills: advanced optimization, performance tuning
13. JIT Compiler
- Runtime code generation
- Tiered compilation (interpreter to JIT)
- Profile-guided optimization
- Dynamic deoptimization
Skills: runtime compilation, profiling
14. Garbage Collector Implementation
- Mark-and-sweep collector
- Generational GC
- Concurrent collection
- Integration with compiler
Skills: memory management, concurrency
15. Polyhedral Optimizer
- Dependence analysis for loops
- Affine transformation framework
- Tiling and parallelization
- Use ISL (Integer Set Library)
Skills: mathematical optimization, advanced loops
16. GPU/SIMD Auto-Vectorizer
- Loop dependence analysis
- Vectorization profitability model
- Generate SIMD instructions (SSE/AVX/NEON)
- Or CUDA/OpenCL kernels
Skills: parallelism, hardware architecture
17. Interprocedural Optimizer
- Call graph construction
- Function inlining heuristics
- Interprocedural constant propagation
- Whole-program optimization
Skills: program-wide analysis
18. Verified Compiler Component
- Use Coq or Isabelle/HOL
- Formalize a compiler pass
- Prove correctness properties
- Extract executable code
Skills: formal verification, theorem proving
19. Tensor/ML Compiler
- Operator fusion
- Memory layout optimization
- Auto-tuning for target hardware
- Integration with ML frameworks
Skills: domain-specific compilation, performance
20. Language with Advanced Type System
- Implement dependent types or linear types
- Type-level computation
- Elaboration and proof search
- Error messages with suggestions
Skills: advanced type theory, PL design
21. Quantum Circuit Compiler
- Quantum gate optimization
- Qubit allocation
- Transpilation to target architectures
- Integration with Qiskit or Cirq
Skills: quantum computing, specialized optimization
22. Compiler with MLIR
- Multi-level IR transformations
- Progressive lowering through dialects
- Custom dialect for domain
- Integration with LLVM backend
Skills: modern compiler infrastructure
Recommended Resources
Books
- "Compilers: Principles, Techniques, and Tools" (Dragon Book) - Aho, Lam, Sethi, Ullman
- "Modern Compiler Implementation in ML/C/Java" - Appel
- "Engineering a Compiler" - Cooper & Torczon
- "Advanced Compiler Design and Implementation" - Muchnick
- "SSA-based Compiler Design" - Rastello & Bouchez
- "Types and Programming Languages" - Pierce
Online Courses
- Stanford CS143: Compilers
- Cornell CS6120: Advanced Compilers
- UCSD's Compilers course
- MIT's Computer Language Engineering
Documentation & Tutorials
- LLVM documentation and tutorials
- GCC internals documentation
- "Let's Build a Compiler" by Jack Crenshaw
- LLVM Kaleidoscope tutorial
Research
- PLDI, POPL, OOPSLA conference proceedings
- CGO (Code Generation and Optimization)
- TOPLAS journal