# Lab 03: DDP Conversion

This starter is an annotated reading example for PyTorch DDP. It is meant to
explain the structure of a DDP program even if you do not have multiple GPUs.

## Reading focus

- `torchrun` creates multiple Python processes and sets rank environment variables.
- `LOCAL_RANK` maps a process to a local GPU.
- `init_process_group(...)` joins a communication group.
- `DistributedDataParallel(...)` wraps the module and registers gradient hooks.
- `loss.backward()` is where DDP gradient buckets are all-reduced.
- `rank == 0` avoids multiple processes writing the same checkpoint.

## Optional commands

If you later want to try the script:

```bash
python3 train_ddp.py --steps 5
torchrun --standalone --nproc-per-node=2 train_ddp.py --steps 20
```

## Questions to answer while reading

- Which variables come from `torchrun` rather than the script?
- Why does DDP not reduce per-GPU model memory?
- Why is gradient synchronization connected to backward instead of `optimizer.step()`?