accelerated computing

accelerated computing roadmap

picking up from where I stopped in school.

PHASE 01 — Bridge to GPU

timeline: 0–3 months · focus: Foundation

CUDA Mode — Intro to CUDA series (threads, blocks, grids, SIMT execution model)
YouTube — CUDA Mode
NVIDIA CUDA C++ Programming Guide — Chapters 1–3 (architecture, programming model, execution model)
NVIDIA Official Docs
Stanford CS149 — Parallel Computing, Lectures 1–4
YouTube — CS149Course Site

Write your first kernel — Vector Addition (Hello World of CUDA)
NVIDIA Simple ExampleYouTube Tutorial
GPU Memory Hierarchy deep dive — global, shared, L1/L2, registers, memory coalescing
YouTube — Memory ModelNVIDIA Blog
Implement a naive matrix multiply (GEMM) — your baseline for all future optimization
YouTube — CUDA GEMM

Read NCCL documentation overview — all-reduce, broadcast, scatter (you know these cold)
NCCL Docs
Book — Programming Massively Parallel Processors, Ch. 1–4 (Kirk & Hwu)
Amazon — Kirk & Hwu

timeline: 3–9 months · focus: Core Skill Building

Read Simon Boehm's CUDA Matmul blog post — implement every optimization step yourself
siboehm.com — Matmul
Implement tiled shared memory GEMM — measure the speedup vs your naive version
YouTube — Tiled GEMMNVIDIA CUTLASS
Thread coarsening, register tiling, vectorized loads — reproduce all steps from Boehm's post
CUDA Mode YouTube

NVIDIA Nsight Systems — first profile session, learn to read the timeline view
Nsight Systems DownloadYouTube Tutorial
NVIDIA Nsight Compute — kernel-level analysis, occupancy, warp stalls, memory throughput
Nsight ComputeRoofline Tutorial
Understand the Roofline Model — is your kernel memory-bound or compute-bound?
YouTube — Roofline
GPU benchmarking methodology — CUDA events for timing, warmup runs, variance, wall-clock vs kernel time. Never trust a single measurement.
CUDA Events DocsYouTube — GPU Benchmarking

cuBLAS — NVIDIA's optimized BLAS library. Understand when it outperforms hand-written kernels and how to call it from CUDA code
cuBLAS DocsYouTube — cuBLAS Tutorial
cuDNN — deep learning primitives (conv, attention, normalization). Learn the API and when to defer to it instead of Triton or custom CUDA
cuDNN DocsNVIDIA cuDNN Guide

Triton official tutorials — vector add → softmax → matmul (in order)
Triton TutorialsYouTube — Triton
Implement FlashAttention in Triton
Triton FlashAttn ExampleFlashAttention Paper

SIMD / AVX-512 intrinsics in C++ — auto-vectorization and manual SIMD
Intel Intrinsics GuideYouTube — SIMD
Book — Computer Systems: A Programmer's Perspective, Ch. 5–6 (Bryant & O'Hallaron)
Amazon — CS:APPCMU Free Labs

Udemy — CUDA Programming Masterclass
Udemy Course
NVIDIA Deep Learning Institute — Fundamentals of Accelerated Computing with CUDA C/C++ (free, comes with certificate)
NVIDIA DLI

timeline: 9–18 months · focus: Distributed & Multi-Accelerator

Multi-GPU programming with CUDA + NCCL — peer-to-peer, NVLink, collectives
NCCL User GuideYouTube — Multi-GPU
GPU-aware MPI — MPI calls operating directly on GPU memory, OpenMPI + CUDA
NVIDIA — CUDA-Aware MPIOpenMPI CUDA FAQ
PyTorch Distributed — DDP and FSDP internals
PyTorch DDP TutorialYouTube — PyTorch Distributed
CUDA Streams and async execution — overlapping compute and communication
CUDA Streams DocsYouTube — CUDA Streams

Read and run llm.c by Andrej Karpathy — GPT-2 in pure C and CUDA
GitHub — llm.cYouTube — Karpathy Walkthrough
Study DeepSpeed / Megatron-LM internals — tensor parallelism, pipeline parallelism, ZeRO
GitHub — DeepSpeedZeRO Paper

Ampere (A100) — sparsity support, TF32, async memory copies, MIG. Read the architecture whitepaper.
NVIDIA Ampere WhitepaperA100 Datasheet
Hopper (H100) — warp specialization, Thread Block Clusters, Tensor Memory Accelerator (TMA), FP8. Understand what's new and why it matters for transformer workloads.
NVIDIA Hopper WhitepaperH100 Architecture Deep Dive

AMD ROCm fundamentals — HIP (CUDA equivalent), platform-agnostic GPU programming
ROCm DocsYouTube — ROCm
Book — Programming Massively Parallel Processors, Ch. 15–20 (Kirk & Hwu) — advanced patterns
Amazon — Kirk & Hwu

timeline: 18–30 months · focus: Advanced Topics

Quantization — INT8, FP8, AWQ, GPTQ — making models smaller and faster
HuggingFace QuantizationYouTube — Quantization
Study vLLM internals — PagedAttention, read the paper and source code
GitHub — vLLMPagedAttention Paper
NVIDIA TensorRT — graph optimization, layer fusion, precision calibration
TensorRT DocsYouTube — TensorRT

Apache TVM — ML compiler fundamentals, how ops get compiled to hardware
TVM DocsYouTube — TVM
MLIR — multi-level intermediate representation, used inside XLA and production AI compilers
MLIR Official SiteYouTube — MLIR