# Lab 01: Transformer Memory Accounting

This starter is annotated reading material for a rough Transformer memory ledger.
You do not need to run it to understand the lab; read the formulas and comments
as a map from model configuration to memory pressure.

## Reading focus

- `attention_per_layer = 4 * D * D` represents Q, K, V and output projections.
- `ffn_per_layer = m * D * D` compresses FFN/SwiGLU variants into a rough multiplier.
- `embedding = vocab_size * D` explains the token table.
- parameter, gradient and Adam memory explain why training needs more than weights.
- KV Cache memory explains why inference serving is constrained by batch, sequence length, layers and KV head shape.

## Optional command

If you later want to check the numbers manually:

```bash
python3 memory_accounting.py
python3 memory_accounting.py --hidden-dim 4096 --num-layers 32 --num-heads 32 --vocab-size 32000 --seq-len 2048 --batch-size 4
```

## Questions to answer while reading

- Which formula counts parameters, and which formula counts runtime memory?
- Why does Adam use more memory than SGD-style updates?
- Why does inference remove gradients/optimizer states but add KV Cache pressure?