back to learning
accelerated computing
accelerated computing roadmap
picking up from where I stopped in school.
PHASE 01 — Bridge to GPU
timeline: 0–3 months · focus: Foundation
Core Concepts — GPU Architecture
- CUDA Mode — Intro to CUDA series (threads, blocks, grids, SIMT execution model)
- NVIDIA CUDA C++ Programming Guide — Chapters 1–3 (architecture, programming model, execution model)
- Stanford CS149 — Parallel Computing, Lectures 1–4
First Kernels — Hands On
- Write your first kernel — Vector Addition (Hello World of CUDA)
- GPU Memory Hierarchy deep dive — global, shared, L1/L2, registers, memory coalescing
- Implement a naive matrix multiply (GEMM) — your baseline for all future optimization
MPI → GPU Bridge
- Read NCCL documentation overview — all-reduce, broadcast, scatter (you know these cold)
- Book — Programming Massively Parallel Processors, Ch. 1–4 (Kirk & Hwu)
PHASE 02 — Go Deep on Optimization
timeline: 3–9 months · focus: Core Skill Building
Kernel Optimization — The Core Craft
- Read Simon Boehm's CUDA Matmul blog post — implement every optimization step yourself
- Implement tiled shared memory GEMM — measure the speedup vs your naive version
- Thread coarsening, register tiling, vectorized loads — reproduce all steps from Boehm's post
Profiling & Performance Analysis
- NVIDIA Nsight Systems — first profile session, learn to read the timeline view
- NVIDIA Nsight Compute — kernel-level analysis, occupancy, warp stalls, memory throughput
- Understand the Roofline Model — is your kernel memory-bound or compute-bound?
- GPU benchmarking methodology — CUDA events for timing, warmup runs, variance, wall-clock vs kernel time. Never trust a single measurement.
cuDNN & cuBLAS — Know When Not to Write a Kernel
- cuBLAS — NVIDIA's optimized BLAS library. Understand when it outperforms hand-written kernels and how to call it from CUDA code
- cuDNN — deep learning primitives (conv, attention, normalization). Learn the API and when to defer to it instead of Triton or custom CUDA
Triton — GPU Programming in Python
- Triton official tutorials — vector add → softmax → matmul (in order)
- Implement FlashAttention in Triton
CPU Optimization — Your Existing Edge
- SIMD / AVX-512 intrinsics in C++ — auto-vectorization and manual SIMD
- Book — Computer Systems: A Programmer's Perspective, Ch. 5–6 (Bryant & O'Hallaron)
Structured Courses
- Udemy — CUDA Programming Masterclass
- NVIDIA Deep Learning Institute — Fundamentals of Accelerated Computing with CUDA C/C++ (free, comes with certificate)
PHASE 03 — Distributed & Multi-Accelerator
timeline: 9–18 months · focus: Distributed & Multi-Accelerator
Multi-GPU — Leveraging Your MPI Knowledge
- Multi-GPU programming with CUDA + NCCL — peer-to-peer, NVLink, collectives
- GPU-aware MPI — MPI calls operating directly on GPU memory, OpenMPI + CUDA
- PyTorch Distributed — DDP and FSDP internals
- CUDA Streams and async execution — overlapping compute and communication
Study Real Systems
- Read and run llm.c by Andrej Karpathy — GPT-2 in pure C and CUDA
- Study DeepSpeed / Megatron-LM internals — tensor parallelism, pipeline parallelism, ZeRO
Hardware Generation Awareness
- Ampere (A100) — sparsity support, TF32, async memory copies, MIG. Read the architecture whitepaper.
- Hopper (H100) — warp specialization, Thread Block Clusters, Tensor Memory Accelerator (TMA), FP8. Understand what's new and why it matters for transformer workloads.
Broaden — AMD & Alternative Hardware
- AMD ROCm fundamentals — HIP (CUDA equivalent), platform-agnostic GPU programming
- Book — Programming Massively Parallel Processors, Ch. 15–20 (Kirk & Hwu) — advanced patterns
PHASE 04 — Specialize
timeline: 18–30 months · focus: Advanced Topics
AI Inference Specialization
- Quantization — INT8, FP8, AWQ, GPTQ — making models smaller and faster
- Study vLLM internals — PagedAttention, read the paper and source code
- NVIDIA TensorRT — graph optimization, layer fusion, precision calibration
Compiler & Runtime Layer
- Apache TVM — ML compiler fundamentals, how ops get compiled to hardware
- MLIR — multi-level intermediate representation, used inside XLA and production AI compilers