Lab 02: Single GPU Training Loop
Annotated code reading lab. Running code is optional.
Foundation / Training
Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
Read code to understand the concept
Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
Core mechanism
- The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.
- Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
Annotated starter links
These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.
Starter Preview
Excerpt from code/lab-02-single-gpu-training/train_single_gpu.py. This preview explains the key idea; the linked starter file is the source of truth.
optimizer.zero_grad(set_to_none=True)
# Forward creates activations that autograd keeps for backward.
logits = model(x)
loss = F.cross_entropy(logits, y)
# Backward fills parameter.grad tensors.
loss.backward()
# AdamW reads gradients, updates parameters and maintains optimizer state.
optimizer.step()
torch.save({
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"steps": args.steps,
}, args.checkpoint)Key code blocks
optimizer.zero_grad- Clears stale gradients. Without it, gradients accumulate across steps, which is useful for gradient accumulation but surprising in a basic loop.
model(x)- Creates the forward computation and the activation values needed by backward.
loss.backward- Runs autograd and populates parameter.grad; in DDP this is also where gradient hooks fire.
optimizer.step- Updates weights and, for AdamW, creates/updates optimizer states.
torch.save- Shows why checkpointing training usually needs both model and optimizer state.
How to read this code
- Memory allocated and memory reserved are not the same in PyTorch CUDA allocator.
- optimizer.step may allocate Adam state on its first call.
- The loop is conceptually the same even when the model becomes a Transformer.
What this code does not mean
- “zero_grad is optional.” It is optional only if you intentionally want gradient accumulation.
- “Checkpointing model weights is enough to resume training.” Optimizer state affects resumed training behavior.
How to say it out loud
A training step runs forward to compute loss, backward to compute gradients, optimizer.step to update parameters, and zero_grad to clear gradients for the next step. CUDA memory changes because activations, gradients and optimizer states appear at different phases of the step.
Additional intuition
- Autograd docs help anchor the key lifecycle: forward records enough graph information for backward, and backward writes gradients to parameters. Official: PyTorch autograd tutorial
- Optimizer state_dict docs are a useful reminder that optimizer state is per-parameter state, not part of the model weights themselves. Official: PyTorch Optimizer state_dict
- Activation checkpointing explanations are useful here because they separate saved tensors for backward from persistent model or optimizer state. Blog: PyTorch activation checkpointing techniques
