# Lab 02: Single GPU Training Loop

This starter is a code-reading walkthrough of a minimal PyTorch training step.
It can run on CUDA or CPU, but running it is optional. The main goal is to see
where activations, gradients, optimizer state and checkpoints appear in code.

## Reading focus

- `model(x)` creates logits and saves activations needed by autograd.
- `loss.backward()` fills `parameter.grad`.
- `optimizer.step()` updates parameters and creates/updates AdamW state.
- `optimizer.zero_grad(set_to_none=True)` prevents accidental gradient accumulation.
- `torch.cuda.memory_allocated()` and `memory_reserved()` describe different allocator views.
- `torch.save(...)` shows why resume checkpoints include optimizer state, not just weights.

## Optional command

If you later want to validate the lifecycle:

```bash
python3 train_single_gpu.py
python3 train_single_gpu.py --steps 50 --batch-size 64 --hidden-dim 128
```

## Questions to answer while reading

- Which line creates activations?
- Which line creates gradients?
- Why can Adam state appear only after the first optimizer step?
- Why is `zero_grad` a semantic choice rather than boilerplate?
