InfraLens

A clear starting point for learning AI infrastructure.

Overview

Lab 01: Transformer Memory Accounting

Annotated code reading lab. Running code is optional.

Concept Goal

Read code to understand the concept

A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.

Mental Model

Core mechanism

  • A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.
  • The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.
  • Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

Annotated Code Preview

Starter Preview

Excerpt from code/lab-01-transformer-memory/memory_accounting.py. This preview explains the key idea; the linked starter file is the source of truth.

attention_per_layer = 4 * d * d        # Q, K, V and output projection
ffn_per_layer = args.ffn_multiplier * d * d
embedding = args.vocab_size * d

total_params = layers * (attention_per_layer + ffn_per_layer) + embedding

param_memory = total_params * args.precision_bytes
# Gradients are another tensor of roughly the same size as parameters.
grad_memory = total_params * args.precision_bytes
# Adam keeps moving-average states; this is why Adam uses much more memory than SGD.
adam_memory = total_params * args.adam_bytes

kv_cache_bytes = (
    args.batch_size * args.seq_len * layers * heads * head_dim * 2 * args.precision_bytes
)
Line-by-line Explanation

Key code blocks

attention_per_layer
Encodes the self-attention projection count. Multi-head attention changes the shape decomposition, not the total D x D projection size.
ffn_per_layer
A compact approximation for FFN/SwiGLU variants. It is not exact, but it gives the right order of magnitude.
param_memory / grad_memory / adam_memory
Maps parameter count to training-state memory. This is the bridge to ZeRO/FSDP.
kv_cache_bytes
Maps inference serving pressure to batch, sequence length, layers, KV heads/head_dim and dtype.
What to Notice

How to read this code

  • Do not multiply attention parameters by number of heads twice.
  • Training memory and inference memory have different dominant terms.
  • The script excludes activation and allocator fragmentation unless you add your own estimates.
Common Misunderstandings

What this code does not mean

  • “7B in FP16 means training needs only 14GB.” Training also needs gradients, optimizer states, activations and buffers.
  • “KV Cache is a training concern.” KV Cache is mainly an inference serving concern.
Interview Explanation

How to say it out loud

A good explanation starts from 4D^2 attention and mD^2 FFN, then says weights are only the lower bound. Training adds gradients and Adam states, while inference drops those but adds KV Cache. This is why ZeRO/FSDP, checkpointing and PagedAttention solve different parts of the memory ledger.

External intuition notes

Additional intuition

Further Reading

Official, paper and practical references