InfraLens - AI Infra Annotated Code Reading Labs

Overview

AI Infra Annotated Code Reading Labs

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Goals

Learning goals

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

ConceptConcept explanation: decoder-only block data flow

Codestarter code isreading material, the focus iskey APIs, variables, and data flow.

Explaineach page providescommon misunderstandings, verbal explanation patternandfurther reading.

Read this after / related practice

Read after the AI Infra handbook. Use AI Infra interview practice when you want to rehearse the verbal explanation.

How to use these labs

From handbook to annotated code

First read the handbook section for the concept.
Then open the corresponding lab page.
Read the mental model before reading the code.
Use the line-by-line explanation to map code to system mechanism.
Use Further Reading only after the local explanation makes sense.
Running starter code is optional; the first pass is reading for mechanism.

External source policy

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Handbook to labs

Mapping concept pages to code-reading pages

Handbook Topic	Related Lab	What the lab makes concrete	External intuition focus
Transformer memory accounting	Lab 01	formulas for parameters, optimizer states, activations and KV Cache	memory ledger terms and what estimates exclude
Single GPU training loop	Lab 02	forward/backward/optimizer/checkpoint flow	activation lifetime and optimizer state lifecycle
DDP and all-reduce	Lab 03	process group, rank, DDP wrapper and gradient sync	gradient all-reduce and bucket overlap
GPU reduction	Lab 04	block/thread/shared memory cooperation	global contention versus staged aggregation
Shared memory banks	Lab 05	tile padding and bank conflict	bank mapping and coalesced global access
Triton kernel model	Lab 06	program/block/vectorized operations	program instances and fused HBM writes
FlashAttention	Lab 07	online softmax and avoiding `S x S` materialization	online softmax and IO-aware attention
ZeRO/FSDP	Lab 08	sharded training state formulas	state sharding and communication peaks
Profiling methodology	Lab 09	nsys/ncu command anatomy	timeline first, kernel counters second
Inference serving	Lab 10	serving config and metrics vocabulary	KV Cache paging and continuous batching
Quantization	Lab 11	what is saved and what can break	weight-only, activation and KV Cache tradeoffs
Topology-aware parallelism	Lab 12	TP/PP/DP/FSDP placement	parallel axes mapped to communication topology

Recommended order

12 code-reading labs

Lab	Topic	Concept focus	What you read	What you should be able to explain	Open	Starter
01	Transformer Memory Accounting	Parameter count and memory ledger	Pure Python formula script	Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.	Open lab	README
02	Single GPU Training Loop	forward / backward / optimizer	Minimal PyTorch training loop	The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.	Open lab	README
03	DDP Conversion	Multi-process training semantics	DDP initialization and loop	Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.	Open lab	README
04	CUDA Reduce Optimization	parallel reduction	Global atomic and shared-memory reduce kernels	Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.	Open lab	README
05	Shared Memory Bank Conflict	shared memory layout	Transpose kernels with 32x32 and 32x33 tiles	Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.	Open lab	README
06	Triton Fused Softmax	program/block mental model	PyTorch baseline and Triton kernel excerpt	Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.	Open lab	README
07	FlashAttention Mental Model	IO-aware attention	Naive attention and online softmax walkthrough	A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.	Open lab	README
08	ZeRO / FSDP Memory Sharding	training state sharding	Memory accounting CLI formulas	ZeRO-1/2/3 what is sharded, FSDP why it needs all-gather and reduce-scatter	Open lab	README
09	Nsight Profiling Workflow	profiling methodology	Command file and checklist excerpt	Nsight Systems, Nsight Compute, PyTorch Profiler respectivelylook atwhat	Open lab	README
10	vLLM Serving Workload Config	serving workload shape	Serving config YAML	Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.	Open lab	README
11	Quantization Comparison	precision tradeoff	Comparison plan matrix	Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.	Open lab	README
12	64-GPU Parallelism Design	topology-aware parallelism	Design worksheet	Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.	Open lab	README

Starter files

Annotated reading material

Each starter under code/ is a compact reading artifact. Use it to connect the handbook concept to concrete variables, shapes, APIs, and interview-ready explanations.

Lab	Starter type	Main file	Reading angle
01	Formula script	memory_accounting.py	How model config becomes a memory ledger
02	PyTorch script	train_single_gpu.py	Where activations, gradients and optimizer state appear
03	DDP script	train_ddp.py	How process groups and autograd hooks create gradient sync
04	CUDA C++	reduce.cu	How reduction changes from global contention to staged aggregation
05	CUDA C++	transpose.cu	How tile layout changes shared-memory bank behavior
06	Triton script	triton_softmax.py	How one program instance maps to a row/block of work
07	Educational PyTorch	flashattention_mental_model.py	How online softmax preserves math while reducing IO
08	Formula script	zero_memory_accounting.py	How sharding state trades communication for memory
09	Command notes	profile_commands.sh	How profiling tools answer different levels of “why slow?”
10	Serving config	benchmark_config.yaml	How workload shape maps to latency, throughput and KV capacity
11	Comparison plan	quantization_comparison_plan.md	How precision choice affects memory, bandwidth and quality risk
12	Worksheet	topology_design_worksheet.md	How parallelism axes map to topology and collectives

Optional notes

If you later want to run or record experiments

The reports/ templates are optional. Use them only when you want to record profiling notes, serving benchmarks, or a parallelism design worksheet.

General optional notes Profiling optional notes Serving optional notes Parallelism optional notes

AI Infra Annotated Code Reading Labs

Learning goals

From handbook to annotated code

Mapping concept pages to code-reading pages

Recommended reading order

12 code-reading labs

Annotated reading material

If you later want to run or record experiments