# Lab 08: ZeRO / FSDP Memory Sharding

This starter is a formula-reading aid for training-state sharding. It mirrors
the memory calculator idea in CLI form, but the concept is understandable just
by reading the equations and comments.

## Reading focus

- `replicated` is the DDP-style baseline: parameters, gradients and optimizer states on every rank.
- `zero1` shards optimizer states.
- `zero2` shards optimizer states and gradients.
- `zero3` shards parameters, gradients and optimizer states.
- The communication notes explain why all-gather and reduce-scatter appear in FSDP/ZeRO-style systems.

## Optional command

If you later want numeric estimates:

```bash
python3 zero_memory_accounting.py
python3 zero_memory_accounting.py --params-b 7 --dp-degree 8
```

## Questions to answer while reading

- Which state does each ZeRO stage shard?
- Why does sharding save memory but add communication?
- What memory terms are not captured by the simple formula?
