Lab 02 - Single GPU Training Loop

Overview

Lab 02: Single GPU Training Loop

Annotated code reading lab. Running code is optional.

Related handbook section

Foundation / Training

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Foundation Training

Concept Goal

Read code to understand the concept

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Mental Model

Core mechanism

The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.
Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

README train_single_gpu.py Optional note template

Annotated Code Preview

Starter Preview

Excerpt from code/lab-02-single-gpu-training/train_single_gpu.py. This preview explains the key idea; the linked starter file is the source of truth.

Open starter file

optimizer.zero_grad(set_to_none=True)

# Forward creates activations that autograd keeps for backward.
logits = model(x)
loss = F.cross_entropy(logits, y)

# Backward fills parameter.grad tensors.
loss.backward()

# AdamW reads gradients, updates parameters and maintains optimizer state.
optimizer.step()

torch.save({
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "steps": args.steps,
}, args.checkpoint)

Line-by-line Explanation

Key code blocks

optimizer.zero_grad: Clears stale gradients. Without it, gradients accumulate across steps, which is useful for gradient accumulation but surprising in a basic loop.
model(x): Creates the forward computation and the activation values needed by backward.
loss.backward: Runs autograd and populates parameter.grad; in DDP this is also where gradient hooks fire.
optimizer.step: Updates weights and, for AdamW, creates/updates optimizer states.
torch.save: Shows why checkpointing training usually needs both model and optimizer state.

What to Notice

How to read this code

Memory allocated and memory reserved are not the same in PyTorch CUDA allocator.
optimizer.step may allocate Adam state on its first call.
The loop is conceptually the same even when the model becomes a Transformer.

Common Misunderstandings

What this code does not mean

“zero_grad is optional.” It is optional only if you intentionally want gradient accumulation.
“Checkpointing model weights is enough to resume training.” Optimizer state affects resumed training behavior.

Interview Explanation

How to say it out loud

A training step runs forward to compute loss, backward to compute gradients, optimizer.step to update parameters, and zero_grad to clear gradients for the next step. CUDA memory changes because activations, gradients and optimizer states appear at different phases of the step.

External intuition notes

Additional intuition

Autograd docs help anchor the key lifecycle: forward records enough graph information for backward, and backward writes gradients to parameters. Official: PyTorch autograd tutorial
Optimizer state_dict docs are a useful reminder that optimizer state is per-parameter state, not part of the model weights themselves. Official: PyTorch Optimizer state_dict
Activation checkpointing explanations are useful here because they separate saved tensors for backward from persistent model or optimizer state. Blog: PyTorch activation checkpointing techniques

InfraLens