InfraLens

A clear starting point for learning AI infrastructure.

Overview

AI Infra Annotated Code Reading Labs

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Goals

Learning goals

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

ConceptConcept explanation: decoder-only block data flow
Codestarter code isreading material, the focus iskey APIs, variables, and data flow.
Explaineach page providescommon misunderstandings, verbal explanation patternandfurther reading.
Read this after / related practice

Read after the AI Infra handbook. Use AI Infra interview practice when you want to rehearse the verbal explanation.

How to use these labs

From handbook to annotated code

  • First read the handbook section for the concept.
  • Then open the corresponding lab page.
  • Read the mental model before reading the code.
  • Use the line-by-line explanation to map code to system mechanism.
  • Use Further Reading only after the local explanation makes sense.
  • Running starter code is optional; the first pass is reading for mechanism.
External source policy

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Recommended order

Recommended reading order

  1. Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
  2. A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.
  3. Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
  4. Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
Reading constraint

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Curriculum

12 code-reading labs

Lab Topic Concept focus What you read What you should be able to explain Open Starter
01Transformer Memory AccountingParameter count and memory ledgerPure Python formula scriptInference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.Open labREADME
02Single GPU Training Loopforward / backward / optimizerMinimal PyTorch training loopThe memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.Open labREADME
03DDP ConversionMulti-process training semanticsDDP initialization and loopDistributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.Open labREADME
04CUDA Reduce Optimizationparallel reductionGlobal atomic and shared-memory reduce kernelsKernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.Open labREADME
05Shared Memory Bank Conflictshared memory layoutTranspose kernels with 32x32 and 32x33 tilesKernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.Open labREADME
06Triton Fused Softmaxprogram/block mental modelPyTorch baseline and Triton kernel excerptKernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.Open labREADME
07FlashAttention Mental ModelIO-aware attentionNaive attention and online softmax walkthroughA Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.Open labREADME
08ZeRO / FSDP Memory Shardingtraining state shardingMemory accounting CLI formulasZeRO-1/2/3 what is sharded, FSDP why it needs all-gather and reduce-scatterOpen labREADME
09Nsight Profiling Workflowprofiling methodologyCommand file and checklist excerptNsight Systems, Nsight Compute, PyTorch Profiler respectivelylook atwhatOpen labREADME
10vLLM Serving Workload Configserving workload shapeServing config YAMLInference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.Open labREADME
11Quantization Comparisonprecision tradeoffComparison plan matrixKernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.Open labREADME
1264-GPU Parallelism Designtopology-aware parallelismDesign worksheetDistributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.Open labREADME
Starter files

Annotated reading material

Each starter under code/ is a compact reading artifact. Use it to connect the handbook concept to concrete variables, shapes, APIs, and interview-ready explanations.

LabStarter typeMain fileReading angle
01Formula scriptmemory_accounting.pyHow model config becomes a memory ledger
02PyTorch scripttrain_single_gpu.pyWhere activations, gradients and optimizer state appear
03DDP scripttrain_ddp.pyHow process groups and autograd hooks create gradient sync
04CUDA C++reduce.cuHow reduction changes from global contention to staged aggregation
05CUDA C++transpose.cuHow tile layout changes shared-memory bank behavior
06Triton scripttriton_softmax.pyHow one program instance maps to a row/block of work
07Educational PyTorchflashattention_mental_model.pyHow online softmax preserves math while reducing IO
08Formula scriptzero_memory_accounting.pyHow sharding state trades communication for memory
09Command notesprofile_commands.shHow profiling tools answer different levels of “why slow?”
10Serving configbenchmark_config.yamlHow workload shape maps to latency, throughput and KV capacity
11Comparison planquantization_comparison_plan.mdHow precision choice affects memory, bandwidth and quality risk
12Worksheettopology_design_worksheet.mdHow parallelism axes map to topology and collectives
Optional notes

If you later want to run or record experiments

The reports/ templates are optional. Use them only when you want to record profiling notes, serving benchmarks, or a parallelism design worksheet.