InfraLens

A clear starting point for learning AI infrastructure.

Overview

Lab 02: Single GPU Training Loop

Annotated code reading lab. Running code is optional.

Concept Goal

Read code to understand the concept

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Mental Model

Core mechanism

  • The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.
  • Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

Annotated Code Preview

Starter Preview

Excerpt from code/lab-02-single-gpu-training/train_single_gpu.py. This preview explains the key idea; the linked starter file is the source of truth.

optimizer.zero_grad(set_to_none=True)

# Forward creates activations that autograd keeps for backward.
logits = model(x)
loss = F.cross_entropy(logits, y)

# Backward fills parameter.grad tensors.
loss.backward()

# AdamW reads gradients, updates parameters and maintains optimizer state.
optimizer.step()

torch.save({
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "steps": args.steps,
}, args.checkpoint)
Line-by-line Explanation

Key code blocks

optimizer.zero_grad
Clears stale gradients. Without it, gradients accumulate across steps, which is useful for gradient accumulation but surprising in a basic loop.
model(x)
Creates the forward computation and the activation values needed by backward.
loss.backward
Runs autograd and populates parameter.grad; in DDP this is also where gradient hooks fire.
optimizer.step
Updates weights and, for AdamW, creates/updates optimizer states.
torch.save
Shows why checkpointing training usually needs both model and optimizer state.
What to Notice

How to read this code

  • Memory allocated and memory reserved are not the same in PyTorch CUDA allocator.
  • optimizer.step may allocate Adam state on its first call.
  • The loop is conceptually the same even when the model becomes a Transformer.
Common Misunderstandings

What this code does not mean

  • “zero_grad is optional.” It is optional only if you intentionally want gradient accumulation.
  • “Checkpointing model weights is enough to resume training.” Optimizer state affects resumed training behavior.
Interview Explanation

How to say it out loud

A training step runs forward to compute loss, backward to compute gradients, optimizer.step to update parameters, and zero_grad to clear gradients for the next step. CUDA memory changes because activations, gradients and optimizer states appear at different phases of the step.

External intuition notes

Additional intuition

Further Reading

Official, paper and practical references