Lab 01 - Transformer Memory Accounting

Overview

Lab 01: Transformer Memory Accounting

Annotated code reading lab. Running code is optional.

Related handbook section

Foundation / Calculators

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Foundation Calculators

Concept Goal

Read code to understand the concept

A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.

Mental Model

Core mechanism

A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.
The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.
Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.

Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

README memory_accounting.py Optional note template

Annotated Code Preview

Starter Preview

Excerpt from code/lab-01-transformer-memory/memory_accounting.py. This preview explains the key idea; the linked starter file is the source of truth.

Open starter file

attention_per_layer = 4 * d * d        # Q, K, V and output projection
ffn_per_layer = args.ffn_multiplier * d * d
embedding = args.vocab_size * d

total_params = layers * (attention_per_layer + ffn_per_layer) + embedding

param_memory = total_params * args.precision_bytes
# Gradients are another tensor of roughly the same size as parameters.
grad_memory = total_params * args.precision_bytes
# Adam keeps moving-average states; this is why Adam uses much more memory than SGD.
adam_memory = total_params * args.adam_bytes

kv_cache_bytes = (
    args.batch_size * args.seq_len * layers * heads * head_dim * 2 * args.precision_bytes
)

Line-by-line Explanation

Key code blocks

attention_per_layer: Encodes the self-attention projection count. Multi-head attention changes the shape decomposition, not the total D x D projection size.
ffn_per_layer: A compact approximation for FFN/SwiGLU variants. It is not exact, but it gives the right order of magnitude.
param_memory / grad_memory / adam_memory: Maps parameter count to training-state memory. This is the bridge to ZeRO/FSDP.
kv_cache_bytes: Maps inference serving pressure to batch, sequence length, layers, KV heads/head_dim and dtype.

What to Notice

How to read this code

Do not multiply attention parameters by number of heads twice.
Training memory and inference memory have different dominant terms.
The script excludes activation and allocator fragmentation unless you add your own estimates.

Common Misunderstandings

What this code does not mean

“7B in FP16 means training needs only 14GB.” Training also needs gradients, optimizer states, activations and buffers.
“KV Cache is a training concern.” KV Cache is mainly an inference serving concern.

Interview Explanation

How to say it out loud

A good explanation starts from 4D^2 attention and mD^2 FFN, then says weights are only the lower bound. Training adds gradients and Adam states, while inference drops those but adds KV Cache. This is why ZeRO/FSDP, checkpointing and PagedAttention solve different parts of the memory ledger.

External intuition notes

Additional intuition

PyTorch memory notes are useful for remembering that allocator behavior and snapshots are separate from the simple parameter ledger. Official: PyTorch CUDA memory management
The PyTorch activation checkpointing blog gives a clean intuition: activation memory can be traded for recomputation, so it belongs in a different bucket from weights and optimizer states. Blog: PyTorch activation checkpointing techniques
DeepSpeed ZeRO material is a good follow-up after the ledger: it starts from replicated training state and asks which parts can be partitioned. Official: DeepSpeed ZeRO tutorial

InfraLens