AI Infra Annotated Code Reading Labs
Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
Learning goals
Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Read after the AI Infra handbook. Use AI Infra interview practice when you want to rehearse the verbal explanation.
From handbook to annotated code
- First read the handbook section for the concept.
- Then open the corresponding lab page.
- Read the mental model before reading the code.
- Use the line-by-line explanation to map code to system mechanism.
- Use Further Reading only after the local explanation makes sense.
- Running starter code is optional; the first pass is reading for mechanism.
Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
Mapping concept pages to code-reading pages
| Handbook Topic | Related Lab | What the lab makes concrete | External intuition focus |
|---|---|---|---|
| Transformer memory accounting | Lab 01 | formulas for parameters, optimizer states, activations and KV Cache | memory ledger terms and what estimates exclude |
| Single GPU training loop | Lab 02 | forward/backward/optimizer/checkpoint flow | activation lifetime and optimizer state lifecycle |
| DDP and all-reduce | Lab 03 | process group, rank, DDP wrapper and gradient sync | gradient all-reduce and bucket overlap |
| GPU reduction | Lab 04 | block/thread/shared memory cooperation | global contention versus staged aggregation |
| Shared memory banks | Lab 05 | tile padding and bank conflict | bank mapping and coalesced global access |
| Triton kernel model | Lab 06 | program/block/vectorized operations | program instances and fused HBM writes |
| FlashAttention | Lab 07 | online softmax and avoiding S x S materialization | online softmax and IO-aware attention |
| ZeRO/FSDP | Lab 08 | sharded training state formulas | state sharding and communication peaks |
| Profiling methodology | Lab 09 | nsys/ncu command anatomy | timeline first, kernel counters second |
| Inference serving | Lab 10 | serving config and metrics vocabulary | KV Cache paging and continuous batching |
| Quantization | Lab 11 | what is saved and what can break | weight-only, activation and KV Cache tradeoffs |
| Topology-aware parallelism | Lab 12 | TP/PP/DP/FSDP placement | parallel axes mapped to communication topology |
Recommended reading order
- Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
- A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.
- Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
- Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
12 code-reading labs
| Lab | Topic | Concept focus | What you read | What you should be able to explain | Open | Starter |
|---|---|---|---|---|---|---|
| 01 | Transformer Memory Accounting | Parameter count and memory ledger | Pure Python formula script | Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics. | Open lab | README |
| 02 | Single GPU Training Loop | forward / backward / optimizer | Minimal PyTorch training loop | The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely. | Open lab | README |
| 03 | DDP Conversion | Multi-process training semantics | DDP initialization and loop | Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent. | Open lab | README |
| 04 | CUDA Reduce Optimization | parallel reduction | Global atomic and shared-memory reduce kernels | Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound. | Open lab | README |
| 05 | Shared Memory Bank Conflict | shared memory layout | Transpose kernels with 32x32 and 32x33 tiles | Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound. | Open lab | README |
| 06 | Triton Fused Softmax | program/block mental model | PyTorch baseline and Triton kernel excerpt | Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound. | Open lab | README |
| 07 | FlashAttention Mental Model | IO-aware attention | Naive attention and online softmax walkthrough | A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable. | Open lab | README |
| 08 | ZeRO / FSDP Memory Sharding | training state sharding | Memory accounting CLI formulas | ZeRO-1/2/3 what is sharded, FSDP why it needs all-gather and reduce-scatter | Open lab | README |
| 09 | Nsight Profiling Workflow | profiling methodology | Command file and checklist excerpt | Nsight Systems, Nsight Compute, PyTorch Profiler respectivelylook atwhat | Open lab | README |
| 10 | vLLM Serving Workload Config | serving workload shape | Serving config YAML | Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics. | Open lab | README |
| 11 | Quantization Comparison | precision tradeoff | Comparison plan matrix | Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound. | Open lab | README |
| 12 | 64-GPU Parallelism Design | topology-aware parallelism | Design worksheet | Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent. | Open lab | README |
Annotated reading material
Each starter under code/ is a compact reading artifact. Use it to connect the handbook concept to concrete variables, shapes, APIs, and interview-ready explanations.
| Lab | Starter type | Main file | Reading angle |
|---|---|---|---|
| 01 | Formula script | memory_accounting.py | How model config becomes a memory ledger |
| 02 | PyTorch script | train_single_gpu.py | Where activations, gradients and optimizer state appear |
| 03 | DDP script | train_ddp.py | How process groups and autograd hooks create gradient sync |
| 04 | CUDA C++ | reduce.cu | How reduction changes from global contention to staged aggregation |
| 05 | CUDA C++ | transpose.cu | How tile layout changes shared-memory bank behavior |
| 06 | Triton script | triton_softmax.py | How one program instance maps to a row/block of work |
| 07 | Educational PyTorch | flashattention_mental_model.py | How online softmax preserves math while reducing IO |
| 08 | Formula script | zero_memory_accounting.py | How sharding state trades communication for memory |
| 09 | Command notes | profile_commands.sh | How profiling tools answer different levels of “why slow?” |
| 10 | Serving config | benchmark_config.yaml | How workload shape maps to latency, throughput and KV capacity |
| 11 | Comparison plan | quantization_comparison_plan.md | How precision choice affects memory, bandwidth and quality risk |
| 12 | Worksheet | topology_design_worksheet.md | How parallelism axes map to topology and collectives |
If you later want to run or record experiments
The reports/ templates are optional. Use them only when you want to record profiling notes, serving benchmarks, or a parallelism design worksheet.

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.