Lab 01: Transformer Memory Accounting
Annotated code reading lab. Running code is optional.
Foundation / Calculators
Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
Read code to understand the concept
A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.
Core mechanism
- A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.
- The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.
- Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
Annotated starter links
These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.
Starter Preview
Excerpt from code/lab-01-transformer-memory/memory_accounting.py. This preview explains the key idea; the linked starter file is the source of truth.
attention_per_layer = 4 * d * d # Q, K, V and output projection
ffn_per_layer = args.ffn_multiplier * d * d
embedding = args.vocab_size * d
total_params = layers * (attention_per_layer + ffn_per_layer) + embedding
param_memory = total_params * args.precision_bytes
# Gradients are another tensor of roughly the same size as parameters.
grad_memory = total_params * args.precision_bytes
# Adam keeps moving-average states; this is why Adam uses much more memory than SGD.
adam_memory = total_params * args.adam_bytes
kv_cache_bytes = (
args.batch_size * args.seq_len * layers * heads * head_dim * 2 * args.precision_bytes
)Key code blocks
attention_per_layer- Encodes the self-attention projection count. Multi-head attention changes the shape decomposition, not the total D x D projection size.
ffn_per_layer- A compact approximation for FFN/SwiGLU variants. It is not exact, but it gives the right order of magnitude.
param_memory / grad_memory / adam_memory- Maps parameter count to training-state memory. This is the bridge to ZeRO/FSDP.
kv_cache_bytes- Maps inference serving pressure to batch, sequence length, layers, KV heads/head_dim and dtype.
How to read this code
- Do not multiply attention parameters by number of heads twice.
- Training memory and inference memory have different dominant terms.
- The script excludes activation and allocator fragmentation unless you add your own estimates.
What this code does not mean
- â7B in FP16 means training needs only 14GB.â Training also needs gradients, optimizer states, activations and buffers.
- âKV Cache is a training concern.â KV Cache is mainly an inference serving concern.
How to say it out loud
A good explanation starts from 4D^2 attention and mD^2 FFN, then says weights are only the lower bound. Training adds gradients and Adam states, while inference drops those but adds KV Cache. This is why ZeRO/FSDP, checkpointing and PagedAttention solve different parts of the memory ledger.
Additional intuition
- PyTorch memory notes are useful for remembering that allocator behavior and snapshots are separate from the simple parameter ledger. Official: PyTorch CUDA memory management
- The PyTorch activation checkpointing blog gives a clean intuition: activation memory can be traded for recomputation, so it belongs in a different bucket from weights and optimizer states. Blog: PyTorch activation checkpointing techniques
- DeepSpeed ZeRO material is a good follow-up after the ledger: it starts from replicated training state and asks which parts can be partitioned. Official: DeepSpeed ZeRO tutorial
