0. Prerequisites & Foundation

0.1 Mathematical Foundations

Linear Algebra

  • Matrix operations, eigenvalues/eigenvectors
  • Vector spaces, transformations
  • Singular Value Decomposition (SVD)

Calculus & Optimization

  • Multivariable calculus, gradients
  • Chain rule, backpropagation mathematics
  • Gradient descent variants (SGD, Adam, AdamW)
  • Learning rate scheduling

Probability & Statistics

  • Probability distributions (Gaussian, Bernoulli)
  • Bayesian inference
  • Maximum Likelihood Estimation (MLE)
  • KL divergence, cross-entropy

Information Theory

0.2 Deep Learning Fundamentals

Neural Network Basics

  • Perceptrons, activation functions
  • Feedforward networks (MLP)
  • Backpropagation algorithm
  • Loss functions (MSE, Cross-Entropy, L1/L2)

Convolutional Neural Networks (CNNs)

  • Convolution operations, pooling
  • Receptive fields
  • Batch normalization, layer normalization
  • ResNet, DenseNet architectures
  • U-Net architecture (critical for diffusion)

Recurrent Neural Networks (RNNs)

  • LSTM, GRU architectures
  • Sequence modeling
  • Attention mechanisms

Transformers

  • Self-attention mechanism
  • Multi-head attention
  • Positional encoding (sinusoidal, learned)
  • Layer normalization
  • Vision Transformers (ViT)
  • Encoder-decoder architecture
  • BERT, GPT architectures
0.3 Computer Vision Fundamentals

Image Processing

  • Color spaces (RGB, HSV, LAB)
  • Image filtering, edge detection
  • Image augmentation techniques

Feature Extraction

  • SIFT, SURF, ORB
  • CNN feature maps
  • Semantic segmentation

3D Vision Basics

  • Camera geometry, intrinsic/extrinsic parameters
  • Structure from Motion (SfM)
  • Multi-view geometry
  • Point clouds, meshes, voxels
0.4 Programming & Tools

Python Ecosystem

  • NumPy, SciPy for numerical computing
  • OpenCV for image processing
  • Matplotlib, Plotly for visualization

Deep Learning Frameworks

  • PyTorch (recommended for research)
  • TensorFlow/Keras
  • JAX for high-performance computing

GPU Computing

  • CUDA basics
  • Memory management
  • Mixed precision training (FP16, BF16)

1. Text-to-Image Generation

Foundation Models Overview

Text-to-image generation has evolved through three major paradigms: GANs, VAEs, and Diffusion Models.

1.1 Foundation Models

Generative Adversarial Networks (GANs)

  • Core Concepts: Generator-Discriminator architecture, Adversarial loss
  • Key Architectures: DCGAN, StyleGAN, Progressive GAN, BigGAN
  • Text-to-Image GANs: StackGAN, AttnGAN, ControlGAN, XMC-GAN
  • Limitations: Training instability, Mode collapse, Limited diversity

Variational Autoencoders (VAEs)

  • Core Concepts: Encoder-decoder architecture, Latent space representation
  • Key Concepts: Reparameterization trick, Evidence Lower Bound (ELBO)
  • Regularization: KL divergence regularization

Diffusion Models (State-of-the-Art)

  • Core Principles: Forward/reverse diffusion processes
  • Mathematical Framework: DDPM, Score-based models, SDEs
  • Key Architectures: U-Net Based Diffusion, DDPM, DDIM, LDM
  • Diffusion Transformers (DiT): Vision Transformer backbone

1.2 Text Encoding for Image Generation

1.2.1 Text Encoders

  • CLIP (Contrastive Language-Image Pre-training): Vision-language alignment, Used in Stable Diffusion, DALL-E 2
  • T5 (Text-to-Text Transfer Transformer): Encoder-decoder architecture, Better semantic understanding
  • BERT-based Encoders: Bidirectional context, Used in early models

1.2.2 Cross-Attention Mechanisms

  • Text-to-image cross-attention
  • Key-value-query formulation
  • Multi-scale attention
  • Spatial attention maps

1.3 Conditioning Techniques

  • Classifier-Free Guidance (CFG): Conditional vs unconditional training
  • Classifier Guidance: External classifier gradients
  • Cross-Attention Conditioning: Text embeddings as keys/values
  • Adaptive Instance Normalization (AdaIN): Style transfer concepts

1.4 Advanced Techniques

ControlNet

Spatial conditioning (edges, depth, pose), Trainable copy of U-Net blocks

IP-Adapter

Image prompt adapter, Cross-attention image conditioning

LoRA (Low-Rank Adaptation)

Efficient fine-tuning, Low-rank weight matrices, Reduced training costs

Textual Inversion

Custom concept learning, Embedding optimization, Few-shot personalization

DreamBooth

Subject-driven generation, Fine-tuning with regularization, Identity preservation

1.5 State-of-the-Art Models (2024-2025)

FLUX.1 (Black Forest Labs)

Cutting-edge quality, 12B parameter model, Flow-matching architecture

Stable Diffusion 3 & SD3.5

Diffusion Transformer (DiT) architecture, Improved text rendering

DALL-E 3 (OpenAI)

Enhanced prompt following, GPT-4 caption rewriting, Safety improvements

Imagen 3 (Google)

Photorealistic outputs, Text rendering capabilities, Large-scale training

HunyuanImage (Tencent)

Multimodal LLM backbone, Long prompt support (1000+ words), Bilingual capabilities

Qwen-Image (Alibaba)

Excellent text rendering, Multilingual support, Unified generation/editing

2. Text-to-Video Generation

2.1 Video Generation Challenges

Temporal Consistency

  • Frame-to-frame coherence
  • Motion smoothness
  • Object persistence

Computational Complexity

  • High memory requirements
  • Long training times
  • Inference costs

Motion Understanding

  • Physics simulation
  • Natural movement

2.2 Core Architectures

2.2.1 Video Diffusion Models

  • 3D U-Net Architecture: Spatial convolutions (2D), Temporal convolutions (1D)
  • Key Concepts: Per-frame noise addition, Temporal attention layers
  • Models: Imagen Video (Google), Make-A-Video (Meta), Video Diffusion Models (VDM)

2.2.2 Diffusion Transformers for Video

  • Architecture Components: 3D patch embedding, Spatio-temporal attention
  • State-of-the-Art Models: Sora (OpenAI), HunyuanVideo (Tencent), Veo 3 (Google), Gen-3/4 (Runway)
  • Sora Features: Long-duration videos (60+ seconds), High resolution (1080p), Physics understanding

2.2.3 Autoregressive Video Models

  • Frame-by-Frame Generation: Conditional on previous frames, Recurrent architectures
  • Models: VideoGPT, NUWA, Phenaki

2.2.4 Latent Video Diffusion

  • 3D VAE Compression: Spatial compression (16x16), Temporal compression (4x)
  • Benefits: Reduced memory, Faster training, Scalability

2.3 Temporal Modeling Techniques

  • Temporal Attention: Self-attention across frames, Causal masking
  • Motion Representations: Optical flow, Motion vectors, Velocity fields
  • Frame Interpolation: In-betweening, Temporal super-resolution

2.4 Video-Specific Training Strategies

  • Data Augmentation: Temporal cropping, Frame rate variation
  • Multi-Stage Training: Image pre-training, Video fine-tuning
  • Token Segmentation: Overlapping segments, Temporal coherence maintenance

2.5 Advanced Features (2025)

Audio-Video Synchronization

Joint audio-visual generation, Lip-sync, Sound effect generation

Camera Control

Camera motion specification, Focal length control, Cinematic effects

Multi-Shot Generation

Scene transitions, Shot composition, Narrative coherence

Physics Simulation

Realistic motion, Object interactions, Gravity, momentum

3. Text-to-3D Generation

3.1 3D Representations

Explicit Representations

  • Meshes (vertices, faces)
  • Point clouds
  • Voxels
  • Multi-view images

Implicit Representations

  • Neural Radiance Fields (NeRF)
  • Signed Distance Functions (SDF)
  • Occupancy fields

Hybrid Representations

  • 3D Gaussian Splatting
  • Neural surface representations

3.2 Neural Radiance Fields (NeRF)

3.2.1 Core NeRF Concepts

  • Architecture: MLP, 5D input: (x, y, z, θ, φ), Output: (RGB, density)
  • Volume Rendering: Ray marching, Alpha compositing, Differentiable rendering
  • Training Process: Multi-view supervision, Photometric loss

3.2.2 NeRF Variants

Instant-NGP (NVIDIA)

Hash encoding, Multi-resolution grid, 1000x faster training

Mip-NeRF

Anti-aliasing, Multi-scale representation, Cone tracing

NeRF-W (NeRF in the Wild)

Appearance embedding, Transient objects, Lighting variations

TensorRF

Tensor decomposition, Faster rendering, Lower memory

3.2.3 NeRF for Text-to-3D

DreamFusion (Google)

Score Distillation Sampling (SDS), 2D diffusion model as prior

Magic3D

Two-stage generation, Coarse NeRF + fine mesh, Faster than DreamFusion

ProlificDreamer

Variational Score Distillation (VSD), Higher quality, Better diversity

3.3 3D Gaussian Splatting

  • Core Concepts: 3D Gaussians as primitives, Explicit representation
  • Advantages over NeRF: Faster rendering (real-time), Better quality, Explicit editing
  • Text-to-3D with Gaussians: DreamGaussian, GaussianDreamer, LucidDreamer

3.4 Direct 3D Generation

3.4.1 Multi-View Diffusion

  • Process: Generate multiple views, 3D reconstruction, Consistency enforcement
  • Models: MVDream, SyncDreamer, Zero123 (view-conditioned)

3.4.2 Native 3D Diffusion

  • Voxel Diffusion: 3D U-Net, Memory intensive
  • Point Cloud Diffusion: Point-E (OpenAI), Shap-E, Efficient but lower quality
  • Triplane Representations: 3D features on 2D planes, EG3D architecture, Memory efficient

3.5 Text-to-3D Pipeline

Stage 1: Multi-View Generation

Text → Multiple 2D images, Consistent viewpoints, Diffusion-based

Stage 2: 3D Reconstruction

NeRF optimization, Gaussian Splatting, Mesh extraction

Stage 3: Refinement

Texture enhancement, Geometry optimization, PBR materials

3.6 State-of-the-Art Models (2024-2025)

Meta 3D Gen

Fast pipeline (<1 minute), High-quality meshes, PBR texture support

Rodin Gen-2 (Hyper3D)

10B parameters, BANG architecture, Production-ready assets

Meshy AI

Text-to-3D, Image-to-3D, AI texture generation, Multi-format export

Tripo AI (TripoSR)

Fast reconstruction, Unity/Unreal integration, API-friendly

3.7 Advanced 3D Techniques

Texture Generation

PBR materials (albedo, roughness, metallic, normal), Texture painting

Rigging & Animation

Automatic rigging, Skeleton generation, Motion retargeting

Mesh Optimization

Polygon reduction, Topology optimization, LOD generation

4. Core Algorithms & Techniques

4.1 Diffusion Model Algorithms

Forward Process (Noising)

q(x_t|x_t-1) = N(x_t; sqrt(1-β_t)*x_t-1, β_t*I)

Reverse Process (Denoising)

p_θ(x_t-1|x_t) = N(x_t-1; μ_θ(x_t, t), Σ_θ(x_t, t))

Training Objective (Noise Prediction)

L = E[||ε - ε_θ(x_t, t)||^2]

Sampling Methods

  • DDPM: Full Markov chain
  • DDIM: Deterministic, faster
  • DPM-Solver: ODE solver, efficient
  • Euler, Heun methods

4.2 Attention Mechanisms

Self-Attention

Attention(Q, K, V) = softmax(QKT / √dk) · V

Cross-Attention

Q = Linear(image_features)
K, V = Linear(text_embeddings)

Multi-Head Attention

  • Parallel attention heads
  • Concatenation and projection
  • Different representation subspaces

4.3 Loss Functions

Reconstruction Losses

  • MSE (L2 loss)
  • L1 loss
  • Perceptual loss (VGG features)

Adversarial Losses

  • Binary cross-entropy
  • Wasserstein loss
  • Hinge loss

Regularization

  • KL divergence (VAE)
  • Total variation
  • Sparsity constraints

Semantic Losses

  • CLIP similarity
  • LPIPS (Learned Perceptual)

4.4 Optimization Techniques

Optimizers

  • Adam, AdamW
  • Lion optimizer
  • Muon optimizer (for video)

Learning Rate Schedules

  • Warmup
  • Cosine annealing
  • Step decay
  • Exponential decay

Gradient Techniques

  • Gradient clipping
  • Gradient accumulation
  • Mixed precision training (AMP)

4.5 Sampling & Inference

Guidance Techniques

  • Classifier guidance
  • Classifier-free guidance (CFG)
  • Negative prompts

Sampling Schedulers

  • Linear noise schedule
  • Cosine schedule
  • Karras schedule

Quality vs Speed Tradeoffs

  • Step count (20-50 steps typical)
  • Resolution
  • Batch size

5. Cutting-Edge Developments (2024-2025)

5.1 Architecture Innovations

Diffusion Transformers (DiT)

Replacing U-Net with transformers, Better scalability, Used in Sora, SD3, FLUX

Flow Matching

Alternative to diffusion, Continuous normalizing flows, Used in SD3, FLUX.1

Rectified Flow

Straight-line trajectories, Faster sampling, Better quality

Sparse Diffusion Transformers

Mixture-of-Experts (MoE), Efficient computation, HiDream-1

Consistency Models

Single-step generation, Distillation from diffusion, Real-time generation

5.2 Efficiency Improvements

Latent Compression

Higher compression ratios (16x, 32x), Improved VAEs, Quality preservation

Knowledge Distillation

Student-teacher training, Fewer sampling steps, Maintained quality

Quantization

INT8, FP8 inference, Reduced memory, Hardware acceleration

Flash Attention

Memory-efficient attention, Linear complexity, Faster training

5.3 Enhanced Capabilities

Long-Context Understanding

1000+ word prompts, Detailed scene descriptions, Narrative generation

Multi-Modal Conditioning

Text + image, Text + audio, Text + 3D

Editable Representations

Inpainting/outpainting, Object removal/addition, Style transfer

High-Resolution Generation

4K, 8K images, Tiled generation, Multi-scale training

5.4 Novel Applications

4D Generation (3D + Time)

Dynamic 3D scenes, Deformable objects, Physics-aware

Neural Codec Models

Learned compression, Tokenization for generation

World Models

Interactive 3D environments, Physics simulation, Agent navigation

5.5 Safety & Alignment

Watermarking

Invisible markers, Provenance tracking, Copyright protection

Safety Filters

Content moderation, Bias mitigation, Harmful content detection

Alignment Techniques

RLHF for generation, Preference learning, Red-teaming

6. Implementation Roadmap

Phase 1: Foundation (Months 1-3)

Month 1: Deep Learning Basics

  • Master PyTorch fundamentals
  • Implement basic CNNs
  • Train image classifiers
  • Understand backpropagation

Month 2: Advanced Architectures

  • Implement U-Net from scratch
  • Build Vision Transformer
  • Study attention mechanisms
  • Implement ResNet, DenseNet

Month 3: Generative Models Intro

  • Implement VAE
  • Build simple GAN
  • Understand latent spaces
  • Train on simple datasets (MNIST, CIFAR)

Phase 2: Text-to-Image (Months 4-8)

Month 4: Diffusion Fundamentals

  • Implement DDPM from scratch
  • Understand noise schedules
  • Train simple diffusion model
  • Study score-based models

Month 5: Latent Diffusion

  • Train/fine-tune VAE
  • Implement latent diffusion
  • Integrate text encoder (CLIP)
  • Cross-attention conditioning

Month 6: Stable Diffusion Deep Dive

  • Study Stable Diffusion architecture
  • Fine-tune on custom data
  • Implement ControlNet
  • Add LoRA training

Month 7: Advanced Conditioning

  • Classifier-free guidance
  • Multi-scale generation
  • Implement DreamBooth
  • Textual inversion

Month 8: Optimization & Scaling

  • Mixed precision training
  • Multi-GPU training
  • Gradient checkpointing
  • Inference optimization

Phase 3: Text-to-Video (Months 9-12)

Month 9: Video Understanding

  • Study 3D convolutions
  • Temporal attention
  • Video datasets (WebVid, LAION-400M)
  • Frame interpolation

Month 10: Video Diffusion Basics

  • Implement 3D U-Net
  • Spatio-temporal layers
  • Train on short clips
  • Handle memory constraints

Month 11: Advanced Video Generation

  • Study Sora architecture
  • Implement DiT for video
  • Multi-stage generation
  • Camera control

Month 12: Video Refinement

  • Temporal consistency
  • Motion smoothness
  • Audio integration
  • Quality evaluation

Phase 4: Text-to-3D (Months 13-16)

Month 13: 3D Fundamentals

  • Study NeRF architecture
  • Implement basic NeRF
  • Volume rendering
  • Camera pose optimization

Month 14: Fast 3D Reconstruction

  • Implement Instant-NGP
  • 3D Gaussian Splatting
  • Real-time rendering
  • Mesh extraction

Month 15: Text-to-3D Pipeline

  • Score Distillation Sampling
  • Multi-view generation
  • 3D consistency
  • Texture synthesis

Month 16: Production Pipeline

  • PBR material generation
  • Mesh optimization
  • Export formats
  • Integration with engines

Phase 5: Advanced Topics (Months 17-20)

Month 17: State-of-the-Art Models

  • Study FLUX.1, SD3
  • Flow matching
  • Rectified flow
  • Implementation details

Month 18: Efficiency

  • Model compression
  • Quantization
  • Distillation
  • Hardware optimization

Month 19: Research & Innovation

  • Read latest papers
  • Implement novel techniques
  • Experiment with architectures
  • Contribute to open source

Month 20: Deployment & Scaling

  • Build APIs
  • Cloud deployment
  • Monitoring & logging
  • User interface

7. Tools, Frameworks & Resources

7.1 Core Frameworks

PyTorch Ecosystem

  • PyTorch: Core framework
  • torchvision: Image transforms
  • torchaudio: Audio processing
  • PyTorch Lightning: Training framework
  • Accelerate: Multi-GPU training

Hugging Face

  • Diffusers: Diffusion models library
  • Transformers: Text encoders
  • Datasets: Dataset loading
  • Hub: Model sharing

Specialized Libraries

  • CompVis/Stable-Diffusion: Original SD
  • AUTOMATIC1111: Stable Diffusion WebUI
  • ComfyUI: Node-based generation
  • InvokeAI: Professional interface

7.2 3D Tools

NeRF Libraries

  • Nerfstudio: NeRF framework
  • Instant-NGP: Fast NeRF
  • threestudio: Text-to-3D
  • MVDream: Multi-view generation

3D Software Integration

  • Blender: 3D modeling
  • Unity: Game engine
  • Unreal Engine: Real-time rendering
  • Three.js: Web 3D

7.3 Datasets

Image-Text Pairs

  • LAION-5B: 5 billion pairs
  • LAION-400M: Filtered subset
  • COYO-700M: Alt-text captions
  • DataComp: Curated datasets

Images

  • ImageNet: Classification
  • COCO: Object detection, captions
  • OpenImages: Large-scale
  • Conceptual Captions: 3M+ pairs

Video Datasets

  • WebVid-10M: Text-video pairs
  • HD-VILA-100M: High-quality video
  • InternVid: 7M videos
  • Panda-70M: Diverse video

3D Datasets

  • Objaverse: 800K+ 3D models
  • ShapeNet: CAD models
  • ModelNet: Classification
  • ScanNet: RGB-D scans

7.4 Evaluation Metrics

Image Quality

  • FID (Fréchet Inception Distance)
  • IS (Inception Score)
  • CLIP Score: Text-image alignment
  • LPIPS: Perceptual similarity

Video Quality

  • Temporal consistency metrics
  • Optical flow analysis
  • User studies

3D Quality

  • Chamfer Distance
  • Intersection over Union (IoU)
  • Mesh quality metrics

7.5 Hardware Requirements

Training

  • GPU: NVIDIA A100 (40GB/80GB) recommended
  • Alternative: H100, V100, RTX 4090
  • Memory: 64GB+ system RAM
  • Storage: NVMe SSD (fast data loading)

Inference

  • Minimum: RTX 3060 (12GB)
  • Recommended: RTX 4080/4090
  • Cloud: AWS p3/p4, Google TPU

7.6 Learning Resources

Courses

  • Stanford CS231n: Deep Learning for CV
  • Fast.ai: Practical Deep Learning
  • Hugging Face Diffusion Course
  • 3D Vision Course (TUM)

Papers (Must-Read)

  • "Attention Is All You Need" (Transformers)
  • "Denoising Diffusion Probabilistic Models"
  • "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion)
  • "Scalable Diffusion Models with Transformers"
  • "NeRF: Representing Scenes as Neural Radiance Fields"
  • "DreamFusion: Text-to-3D using 2D Diffusion"

Communities

  • r/StableDiffusion
  • Hugging Face Discord
  • Papers with Code
  • GitHub repositories

Blogs & Tutorials

  • Lilian Weng's Blog
  • Hugging Face Blog
  • Towards Data Science
  • Machine Learning Mastery

7.7 Cloud Platforms

Training Platforms

  • Google Colab: Free GPU
  • Kaggle: Free TPU
  • Lambda Labs: GPU rental
  • RunPod: Affordable GPUs

Deployment

  • Replicate: Model hosting
  • Hugging Face Spaces
  • AWS SageMaker
  • Modal Labs

7.8 Development Tools

Version Control

  • Git, GitHub
  • DVC: Data version control
  • Weights & Biases: Experiment tracking
  • TensorBoard: Visualization

Profiling

  • PyTorch Profiler
  • NVIDIA Nsight
  • Memory profilers

8. Best Practices & Tips

8.1 Training Best Practices

Data Quality

  • Curate high-quality datasets
  • Filter low-resolution images
  • Balance dataset distribution
  • Use proper augmentation

Hyperparameter Tuning

  • Start with known configs
  • Use learning rate warmup
  • Monitor training curves
  • Validate regularly

Memory Management

  • Gradient accumulation
  • Gradient checkpointing
  • Mixed precision (FP16/BF16)
  • Batch size optimization

Qualitative Metrics

  • Visualization
  • Failure case analysis
  • Prompt testing
  • Edge case evaluation

8.3 Debugging Strategies

Common Issues

  • Mode collapse (GANs)
  • Training instability
  • Memory errors
  • Slow convergence

Solutions

  • Learning rate adjustment
  • Architecture modifications
  • Data augmentation
  • Regularization

8.4 Ethical Considerations

Bias & Fairness

  • Dataset bias awareness
  • Diverse representation
  • Bias mitigation techniques

Safety

  • Content filtering
  • Watermarking
  • Usage policies
  • Red-teaming

Copyright

  • Dataset licensing
  • Fair use considerations
  • Attribution

9. Project Milestones

Beginner Projects

Beginner

1. Train simple VAE on MNIST

Implement and train a basic Variational Autoencoder on the MNIST dataset.

Beginner

2. Implement basic GAN

Build a simple Generative Adversarial Network for image generation.

Beginner

3. Fine-tune Stable Diffusion

Fine-tune a pre-trained Stable Diffusion model on a custom dataset.

Beginner

4. Build LoRA for custom style

Create a Low-Rank Adaptation for a specific artistic style.

Intermediate Projects

Intermediate

1. Train text-to-image diffusion model

Train a diffusion model from scratch for text-to-image generation.

Intermediate

2. Implement ControlNet conditioning

Add spatial conditioning capabilities using ControlNet.

Intermediate

3. Build video frame interpolation

Create a model for generating intermediate video frames.

Intermediate

4. Create NeRF from images

Generate a 3D Neural Radiance Field from 2D images.

Advanced Projects

Advanced

1. Full text-to-video pipeline

Build an end-to-end text-to-video generation system.

Advanced

2. Text-to-3D with custom materials

Create a pipeline for generating 3D models with PBR materials.

Advanced

3. Multi-modal generation system

Develop a system that can generate across multiple modalities (image, video, 3D).

Advanced

4. Novel architecture research

Design and implement a novel generative architecture.

Summary Timeline

  • Months 1-3: Deep learning foundations, basic CNNs, VAEs, GANs
  • Months 4-8: Text-to-image mastery, diffusion models, Stable Diffusion
  • Months 9-12: Video generation, temporal modeling, advanced video
  • Months 13-16: 3D generation, NeRF, Gaussian Splatting, production pipeline
  • Months 17-20: Cutting-edge techniques, optimization, deployment

This roadmap provides a comprehensive path from foundational mathematics through state-of-the-art generative AI. Focus on hands-on implementation at each stage, and adjust the timeline based on your background and learning speed.

10. Additional Resources & References

10.1 Key GitHub Repositories

  • CompVis/stable-diffusion
  • CompVis/latent-diffusion
  • huggingface/diffusers
  • AUTOMATIC1111/stable-diffusion-webui
  • llyasviel/ControlNet
  • threestudio-project/threestudio
  • nerfstudio-project/nerfstudio
  • graphdeco-inria/gaussian-splatting

10.2 Paper Collections

  • Diffusion Models: papers.labml.ai/papers/diffusion
  • NeRF Papers: awesome-NeRF (GitHub)
  • Video Generation: Papers with Code
  • 3D Generation: Recent arXiv submissions

10.3 Industry Benchmarks

  • GenEval: Comprehensive evaluation
  • T2I-CompBench: Compositional understanding
  • TIFA: Text-image faithfulness
  • DreamBench: Subject-driven generation