skip to content
back to learning

accelerated computing

accelerated computing roadmap

picking up from where I stopped in school.

PHASE 01 — Bridge to GPU

timeline: 0–3 months · focus: Foundation

Core Concepts — GPU Architecture

First Kernels — Hands On

MPI → GPU Bridge

  • Read NCCL documentation overview — all-reduce, broadcast, scatter (you know these cold)
  • Book — Programming Massively Parallel Processors, Ch. 1–4 (Kirk & Hwu)

PHASE 02 — Go Deep on Optimization

timeline: 3–9 months · focus: Core Skill Building

Kernel Optimization — The Core Craft

Profiling & Performance Analysis

cuDNN & cuBLAS — Know When Not to Write a Kernel

  • cuBLAS — NVIDIA's optimized BLAS library. Understand when it outperforms hand-written kernels and how to call it from CUDA code
  • cuDNN — deep learning primitives (conv, attention, normalization). Learn the API and when to defer to it instead of Triton or custom CUDA

Triton — GPU Programming in Python

CPU Optimization — Your Existing Edge

Structured Courses

  • Udemy — CUDA Programming Masterclass
  • NVIDIA Deep Learning Institute — Fundamentals of Accelerated Computing with CUDA C/C++ (free, comes with certificate)

PHASE 03 — Distributed & Multi-Accelerator

timeline: 9–18 months · focus: Distributed & Multi-Accelerator

Multi-GPU — Leveraging Your MPI Knowledge

Study Real Systems

Hardware Generation Awareness

Broaden — AMD & Alternative Hardware

PHASE 04 — Specialize

timeline: 18–30 months · focus: Advanced Topics

AI Inference Specialization

Compiler & Runtime Layer