InfraLens - AI Infra Systems Map

AI Infra Umbrella

AI Infra Systems Map

The umbrella track for compute, memory, communication, profiling, serving, and tradeoffs. Distributed Training and Transformer Systems are deep-dive branches of this track.

What this track is

Parent systems map

Use this page to connect GPU kernels, memory accounting, distributed training, profiling, serving, quantization, and topology decisions into one AI systems fieldbook.

Distributed Training Transformer Systems Principle Concepts Interview AI Infra

Read this page in passes. First build the main thread: why Transformers create parameter, activation, and KV Cache pressure; why GPUs are often limited by HBM traffic; why distributed training needs collectives; and why inference serving becomes a scheduling and cache-management problem. On the second pass, restate each concept as problem, mechanism, savings, cost, and measurement.

Three main threads

Thread	Core pressure	Question to answer
Compute	Kernels are often limited by memory movement, not only FLOPs.	Which bytes can be reused, fused, or avoided?
Memory	Training state and serving-time KV Cache need separate accounting.	Which state grows, and which optimization changes it?
Communication	Distributed training is shaped by rank ownership and critical-path collectives.	Which collective crosses which topology boundary?

What training focuses on

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

What inference focuses on

Inference focuses on prefill, decode, batching, KV Cache capacity, scheduler behavior, and tail latency under real request distributions.

Suggested learning order

Start with Transformer shapes, parameter count, activation memory, and KV Cache memory.
Then learn GPU memory hierarchy: HBM, L2, shared memory, registers, coalescing, bank conflicts, tiling, and fusion.
Move to DDP, ZeRO/FSDP, tensor/pipeline/context/expert parallelism, and NCCL collectives.
Use profiling to decide whether the bottleneck is CPU orchestration, kernel math, memory traffic, or communication.
Finally move to serving: prefill, decode, KV Cache, PagedAttention, continuous batching, quantization, and serving metrics.

Foundation

Transformer and Memory Accounting

Concept explanation: decoder-only block data flow

For input X: (B,S,D), the block projects Q=XWq, K=XWk, and V=XWv, splits them across H heads with Dh=D/H, and computes attention scores shaped (B,H,S,S). The output returns to (B,S,D) before the FFN and residual path.

Q/K/V shape: X(B,S,D) * Wq(D,D) -> Q(B,S,D) -> Q(B,H,S,D/H)

Attention score: Q(B,H,S,D/H) @ K^T(B,H,D/H,S) -> (B,H,S,S)

Connection: the S x S score matrix explains why long context stresses attention memory and why FlashAttention-style kernels matter.

Parameter count mental model

Self-attention projections are roughly 4D^2 per layer: Q, K, V, and output projection. The FFN is usually 2 * D * Dff; if Dff=4D, that is about 8D^2. Many model cards summarize the layer as mD^2, with embeddings adding V*D.

For a rough estimate, plug values such as D=4096,L=32,V=32000,m=8 into the calculator below, then compare the result with the memory ledger rather than treating parameter count as total memory.

Training memory vs inference memory

Phase	Major resident state	What grows memory	Matching optimization
Training	Weights, gradients, optimizer states such as Adam `m` and `v`, activations, and temporary buffers.	Model state, batch size, sequence length, and saved activations.	ZeRO/FSDP, activation checkpointing, and memory-efficient attention.
Inference	Weights, runtime buffers, and KV Cache.	Concurrent sequences and retained context length.	Quantization, paging, cache-aware scheduling, and smaller KV representations.

Do not mix these into one vague "memory" number. A useful memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache.

Attention activations grow with sequence length, while KV Cache grows with retained context during serving. That difference is why training and inference memory fixes are not interchangeable.

Concept links

Transformer parameter count -> training memory ledger -> single-GPU memory failure -> ZeRO/FSDP partitions parameters, gradients, and optimizer states. Attention's S x S intermediate -> HBM writes and activation pressure -> FlashAttention uses blockwise online softmax to avoid materializing the full score matrix. Inference weights + KV Cache -> PagedAttention and KV quantization become useful serving tools.

How would you explain the memory ledger in an interview?

Name each memory owner first: weights, activations, gradients, optimizer states, temporary buffers, communication buckets, and KV Cache. Then explain which optimization shards, shrinks, moves, or recomputes that state.

Common misunderstanding

Do not say every memory fix solves the same problem. FlashAttention targets attention intermediates, ZeRO/FSDP shards training state, activation checkpointing recomputes activations, and KV Cache work mainly affects inference serving.

Calculators

Estimation tools

Use these quick estimates to keep parameter count, communication traffic, and training memory in separate boxes.

The numbers are intentionally rough. They are good for interview reasoning and sanity checks, not for replacing profiler traces or framework memory summaries.

Parameter-count estimator

Estimate each layer as attention 4D^2, FFN mD^2, plus embedding V*D.

hidden_dim D num_layers L vocab_size V FFN multiplier m

Ring AllReduce time estimator

Approximate ring all-reduce traffic per rank as 2(N-1)/N * data, then compare fast and slow interconnect assumptions.

rank count N payload size GB fast link GB/s slow link GB/s

Training memory ledger estimator

Estimate persistent training state: weights, gradients, and optimizer states. Runtime peaks can still be higher.

parameter count B weight bytes Adam bytes DP degree

GPU Kernel and Memory Hierarchy

From HBM to Triton / FlashAttention

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

Memory hierarchy mental model

Start from where bytes live and how often they move. A kernel can have plenty of math available and still be slow if it repeatedly reads from HBM or writes large intermediates.

Register: Per-thread storage, fastest but limited; excessive use can reduce occupancy.
Shared memory: Block-local scratch space used for tiling, reuse, reductions, and avoiding repeated HBM reads.
L2 / cache: On-GPU cache layer that helps reuse across memory operations but is still much slower than registers.
HBM / global memory: Large device memory with high bandwidth, but expensive enough that layout and access pattern dominate many kernels.

Key mechanisms

Mechanism	Main idea	Concrete takeaway
Coalesced memory access	Arrange neighboring threads to read neighboring addresses.	Inspect address layout before blaming arithmetic throughput.
Shared memory	Stage reused data close to the thread block.	Reuse a loaded tile before returning to HBM.
Bank conflict	Avoid many threads contending for one shared-memory bank.	Padding `tile[32][32]` to `tile[32][33]` is the classic transpose fix.
Reduce implementations	Move from naive dependencies to shared-memory reuse and then warp-level synchronization.	Compare synchronization cost as the reduction scope shrinks.
GEMM tiling	Reuse matrix tiles before returning to HBM.	Increase useful arithmetic per byte fetched.
Softmax fusion	Keep intermediate values close to the kernel.	Avoid materializing extra tensors when numerics and shapes allow it.

FlashAttention mental model

Naive attention materializes the (B,H,S,S) score matrix. FlashAttention-style kernels compute attention in blocks with online softmax so the full score matrix does not need to be written to HBM.

Optimization point	Problem solved	Mechanism	What to measure	Common misunderstanding
Coalescing	Uncoalesced global reads waste memory transactions.	Make adjacent threads access adjacent addresses.	global load/store efficiency, memory throughput	Coalescing is about address pattern, not only total bytes.
Shared memory tiling	Repeated HBM reads dominate arithmetic.	Load tiles once, reuse them inside the block.	HBM throughput, L2 hit, occupancy	Shared memory can also lower occupancy if tile size is too large.
Padding transpose	Shared memory bank conflicts serialize access.	Pad the layout, for example `32x33`, to shift bank mapping.	shared bank conflict metric	Padding fixes one access pattern, not every layout problem.
Kernel fusion	Intermediate tensors create extra reads, writes, and launch overhead.	Combine adjacent operations into one kernel when shapes and numerics allow it.	kernel count, HBM writes, launch overhead	Fusion can increase register pressure or reduce reuse if applied blindly.
FlashAttention	`S x S` attention memory and HBM traffic	blockwise Q/K/V + online softmax	memory peak, attention kernel time, tokens/s	It is an attention algorithm/kernel strategy, not a new model architecture.

How do you tell whether a kernel is bandwidth-bound?

Compare achieved memory bandwidth, Tensor Core utilization, occupancy, and instruction mix. If memory throughput is high while compute utilization is low, the next question is layout, tiling, fusion, or reducing bytes moved.

Distributed Training

Replicas, shards, collectives, and optimizer semantics

Data parallel baseline

DDP keeps a full model replica on each rank, feeds each rank a different data shard, and all-reduces gradient buckets during backward so all replicas apply the same update.

DDP code path to recognize

The user code looks local, but DDP installs autograd hooks. When gradients are produced, those hooks bucket the gradients and launch collectives through the process group.

dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

model = DDP(build_model().cuda(), device_ids=[local_rank])
sampler = DistributedSampler(train_dataset, shuffle=True)

for epoch in range(num_epochs):
    sampler.set_epoch(epoch)
    for batch in loader:
        loss = model(**move_to_cuda(batch)).loss
        loss.backward()       # autograd hooks trigger gradient bucket all-reduce
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Sharding training state

ZeRO and FSDP reduce per-GPU memory by sharding optimizer states, gradients, and parameters at different levels. That saves memory but adds all-gather, reduce-scatter, checkpoint, and runtime peak-memory considerations.

Method	What is partitioned	Key communication	What it saves	What it costs	What to measure
DDP	Input data only; parameters are replicated.	gradient all-reduce	Throughput via more workers	Full model and optimizer memory per GPU	step time, all-reduce overlap, samples/s
ZeRO-1	optimizer states	optimizer state partitioning	Adam states memory	More state movement and optimizer complexity	per-GPU optimizer memory, step overhead
ZeRO-2	optimizer states + gradients	reduce-scatter / all-gather	gradients and Adam states memory	More collective traffic on the backward path	gradient memory, NCCL time
ZeRO-3 / FSDP	parameters + gradients + optimizer states	parameter all-gather, gradient reduce-scatter	Most persistent training state memory	Parameter materialization peaks and shard checkpoint complexity	peak memory, gather time, shard checkpoint cost

Parallelism families

Parallelism	What it splits or routes	Primary systems cost
Tensor Parallel (TP)	Layer-internal matrix work.	High-frequency collectives; keep within a fast interconnect domain.
Pipeline Parallel (PP)	Model layers into stages.	Pipeline bubbles and stage balancing.
Data Parallel (DP)	Input batches while replicating or sharding model state.	Gradient or state synchronization.
Sequence / Context Parallelism	Sequence or context work.	Attention communication while reducing activation memory.
Expert Parallelism	Tokens routed across expert owners.	All-to-all traffic and load balancing.
3D Parallelism	Multiple parallel axes together.	Topology-aware mapping of every collective.

64 GPU topology design example

Keep high-frequency collectives such as tensor-parallel all-reduce inside NVLink when possible, map slower data-parallel communication across nodes, and explain how the placement changes failure domains and checkpoint traffic.

Common misunderstanding

Adding GPUs is not automatically a data-parallel speedup. The real design choice is which state is replicated, which state is sharded, and which collective or all-to-all exchange lands on the critical path.

Communication and Topology

Collectives, topology, and bottleneck diagnosis

Collective input/output intuition

AllReduce: Every rank contributes a tensor and every rank receives the reduced result. DDP gradient sync is the common example.
AllGather: Each rank starts with one shard and receives all shards, often used before a sharded parameter is needed.
ReduceScatter: Ranks reduce inputs and scatter the reduced shards, commonly paired with all-gather in sharded training.
AllToAll: Each rank sends a different slice to every other rank. Expert parallelism and token routing often pay this cost.

Ring AllReduce per-GPU traffic: 2 * (N - 1) / N * data_size

Bandwidth vs latency: large buckets are usually bandwidth-oriented, while many small collectives can be dominated by launch, synchronization, and topology latency.

`nvidia-smi topo -m` How to read it

Topology labels tell you whether traffic stays on fast GPU links or crosses PCIe, host bridges, NUMA boundaries, or the system interconnect. Use them to place ranks, not just to describe hardware.

Label	Intuition	Placement implication
`NV#`	GPUs are connected by NVLink links.	Keep tensor-parallel or high-frequency collectives here when possible.
`PIX`	Traffic goes through a single PCIe switch.	Acceptable for moderate traffic, but weaker than NVLink.
`PHB`	Traffic crosses a PCIe host bridge.	Avoid placing the hottest collectives across this path.
`NODE`	Traffic may cross NUMA boundaries inside one node.	Bind CPU, NIC, and GPU placement deliberately.
`SYS`	Traffic crosses system-level interconnect boundaries.	Treat it as a slow path for frequent GPU-GPU communication.

Diagnosing the bottleneck

First decide whether the wait is compute, memory, communication, or orchestration. Then tie the diagnosis to a trace, counter, or controlled experiment.

Bottleneck	Signal	Typical cause	Next experiment
Compute-bound	Tensor Core utilization is high.	dtype, tile shape, kernel library, or batch shape	Change dtype, kernel backend, or problem shape.
Memory-bound	HBM throughput high, compute utilization low.	Excessive reads/writes or poor locality	Try fusion, tiling, FlashAttention, quantization, or layout changes.
Communication-bound	NCCL ranges dominate the step timeline.	Large buckets, slow topology path, or poor overlap	Change rank placement, bucket size, overlap settings, or parallelism axes.

Profiling Methodology

Trace first, then explain the bottleneck

A good performance explanation names the slow path, shows the evidence, and proposes one controlled experiment.

Tool split

Tool	Question it answers	Evidence scope
Nsight Systems	Where does elapsed time go?	CPU work, CUDA kernels, memory copies, and NCCL on a timeline.
Nsight Compute	Why does one kernel behave this way?	Occupancy, instruction mix, memory throughput, and bank conflicts.
PyTorch Profiler	Which framework operation created the work?	Operator-to-kernel attribution and high-level traces.

Roofline mental model

Roofline reasoning asks whether arithmetic intensity is high enough to use available compute. If it is not, optimizing memory movement may matter more than adding more math throughput.

Keep workload fixedDo not compare traces with different batch sizes, prompt lengths, output lengths, or data loaders.

First look at SystemsCheck CPU gaps, CUDA launch overhead, NCCL waits, memory copies, and queue time.

Then look at ComputeUse Nsight Compute for occupancy, memory throughput, instruction mix, and bank conflicts.

Change one variableToggle one backend, dtype, bucket size, or batch shape so the evidence is interpretable.

Report residual riskName what the trace does not prove, such as production traffic mix or version-sensitive backend behavior.

Symptom	Likely cause	Tool evidence	Next move
GPU gaps between kernels	dataloader, tokenizer, CPU sync, H2D, or launch overhead	Nsight Systems CPU/GPU timeline	Fix input pipeline or remove synchronization.
NCCL dominates step time	communication-bound parallel layout	nsys trace + topology mapping	Revisit rank placement, overlap, and bucket size.
Memory throughput near peak	HBM-bound kernel or extra tensor materialization	Nsight Compute memory metrics	Try tiling, fusion, FlashAttention, quantization, or layout changes.
Shared memory conflict high	bank conflict or poor tile layout	NCU shared memory bank conflict	Pad or reshape shared-memory tiles.
Serving P99 high	queueing, long prompts, decode pressure, or KV Cache fragmentation	request traces + scheduler metrics	Separate prefill and decode metrics before tuning batching.

torchrun --nproc-per-node=8 train.py --config config.yaml
nsys profile --trace=cuda,nvtx,osrt --output=step_report python train.py
ncu --set full --target-processes all python kernel_bench.py
# Serving comparison: keep prompt/output length distributions fixed before comparing engines.

Inference Serving

Prefill, decode, KV Cache, batching, and tail latency

Serving work is not one uniform forward pass. Separate prompt ingestion from token-by-token generation before reasoning about throughput, latency, or cache capacity.

Serving loop mental model

A request first enters a scheduler or queue, then prefill computes the prompt and writes KV Cache. Decode repeatedly reads that cache, appends one token step, and streams output while the scheduler admits more work.

Prefill vs Decode

Phase	State transition	Dominant metric	Typical pressure
Prefill	Processes prompt tokens and builds the initial KV Cache.	TTFT	Compute and attention work for long prompts.
Decode	Reads and appends KV Cache one generated token at a time.	TPOT	Cache capacity, memory bandwidth, and scheduling under load.

KV Cache memory scales with batch * layers * kv_heads * head_dim * sequence_length * 2(K,V) * bytes. That is why request length distribution and cache management are first-order serving concerns.

Formula: serving KV Cache memory 2 * layers * batch * seq_len * kv_heads * head_dim * bytes

2 counts key and value tensors.
batch and seq_len capture concurrent requests and cached context length.
kv_heads can be smaller than attention heads for MQA/GQA models.
bytes depends on FP16/BF16 or cache quantization support.

request arrives
  -> scheduler / queue
  -> prefill builds KV Cache and determines TTFT
  -> decode loop reads KV Cache one step at a time
  -> stream output tokens and track TPOT, P95/P99

PagedAttention, Continuous batching and Prefix cache

PagedAttention-style cache management reduces waste from variable-length requests. Continuous batching improves GPU utilization by mixing requests at different decode positions, but it can increase queueing complexity and tail-latency risk.

Quantization and speculative decoding

Quantization can reduce memory and bandwidth pressure when kernels support the target dtype and accuracy remains acceptable. Speculative decoding helps only when acceptance rate, draft-model cost, and scheduler overhead make the extra machinery worthwhile.

Metric	Meaning	What it tells you
TTFT	Time To First Token	Prefill, queueing, and admission behavior.
TPOT	Time Per Output Token	Decode loop cost and cache-read pressure.
QPS	Requests served per second	Admission capacity under a specific traffic mix.
tokens/s	Generated tokens per second	Throughput, but only meaningful with prompt/output length distributions.
P95 / P99	Tail latency percentiles	Queueing, long-context pressure, or scheduling unfairness.

System	Useful mental model	Check before claiming
vLLM	PagedAttention, continuous batching, and serving scheduler behavior.	Version, backend, supported models, and workload shape.
SGLang	Structured generation, runtime scheduling, and serving orchestration.	Frontend language features, backend engine, and cache behavior.
TensorRT-LLM	Optimized inference runtime with engine build and kernel choices.	Hardware, dtype, quantization path, and engine build constraints.

How should you explain high serving latency?

Separate queue time, prefill time, decode TPOT, and KV Cache pressure. Then say whether the evidence points to scheduling, memory bandwidth, cache fragmentation, or request-shape skew.

Tradeoff Matrix

What each optimization buys and costs

Use this matrix to avoid generic claims. Each row should connect a concrete bottleneck to a mechanism, cost, and measurement.

The most common mistake is naming a technique without naming the state object or resource it changes.

Technique	Problem	Mechanism	Saves what	Costs what	When to use	When not to use	What to measure
ZeRO / FSDP	Training state does not fit per GPU.	Shard parameters, gradients, and/or optimizer states.	Persistent training memory	Extra collectives, materialization peaks, checkpoint complexity	Large training jobs where memory is the limit	Small models where communication dominates	peak memory, NCCL time, step time
Activation checkpointing	Activations dominate training memory.	Drop selected activations and recompute them in backward.	activation memory	extra compute and longer step time	Memory-bound training with spare compute	Compute-bound training already near the time budget	peak memory, step time, recompute overhead
FlashAttention	Attention `S x S` intermediates and HBM traffic are large.	blockwise Q/K/V + online softmax	HBM traffic, attention memory	backend constraints and kernel compatibility	Long context or attention-heavy workloads	Unsupported masks, dtypes, layouts, or hardware	attention time, memory peak, tokens/s
Tensor Parallel	One layer's matrix work is too large for one GPU.	Split layer-internal computation across ranks.	per-GPU compute and parameter load	high-frequency collectives	Fast intra-node interconnect such as NVLink	Slow cross-node paths for small layers	NCCL time per layer, MFU, step time
Pipeline Parallel	Model depth does not fit or scale well on one device group.	Split layers into pipeline stages.	per-stage model memory	bubble, stage imbalance, activation buffering	Very deep models with balanced stages	Small batch or uneven stage timing	bubble ratio, stage time, utilization
SP / CP	Sequence or context memory is too large.	Split sequence/context dimension across ranks.	activation/KV/context memory	all-to-all/gather, mask and kernel complexity	Long-context training or inference	Short sequences where overhead dominates	memory peak, collective time, correctness tests
Expert Parallel	MoE experts increase capacity but cannot all run on every rank.	Route tokens to experts placed on different ranks.	expert parameter and compute placement	all-to-all, load imbalance, capacity drops	MoE models with enough token volume	Highly skewed routing without capacity control	all-to-all time, expert load, dropped tokens
Quantization	Weights or KV Cache stress memory and bandwidth.	Store or compute with fewer bits when kernels support it.	memory, HBM bandwidth	accuracy risk, calibration, kernel support	Serving workloads limited by memory traffic	Quality-sensitive paths without validation	quality eval, tokens/s, TPOT, memory
Continuous batching	GPU idles while requests arrive and finish at different times.	Admit new requests into ongoing decode batches.	GPU idle time, goodput	scheduling complexity and queueing variance	Mixed online serving traffic	Strict isolation or very predictable single-request traffic	QPS, TPOT, P95/P99, queue time
Speculative decoding	Decode is slow one token at a time.	Draft tokens with a cheaper model and verify with the target model.	wall-clock decode time	draft-model cost, acceptance-rate sensitivity	High acceptance rate with cheap draft model	Low acceptance rate or tight quality constraints	acceptance rate, TPOT, quality
Triton custom kernel	Framework kernels leave a clear hot path.	Write a workload-specific fused or tiled kernel.	kernel launches, HBM traffic, overhead	maintenance and correctness burden	Stable hot workload with measurable headroom	Rapidly changing shapes or fragile numerics	kernel time, correctness, maintenance cost

Whiteboard Drills

Practice saying the system tradeoff out loud

Use these prompts to turn mechanisms into interview-ready explanations.

How do you trace a Transformer block's tensor shapes?

Short answer: start from X(B,S,D), project Q/K/V, split heads to (B,H,S,D/H), form scores (B,H,S,S), then return to (B,S,D).

Deeper explanation: the shape trace tells you where attention memory grows, where FFN parameters live, and which axis a parallel strategy would split.

Pitfall: do not treat the S x S score matrix as a permanent model parameter; it is an activation/intermediate.

Follow-up: connect the shape trace to FlashAttention and KV Cache memory.

Why does FFN parameter count often look like mD^2?

Short answer: a two-layer FFN has roughly 2 * D * Dff parameters; when Dff is a multiple of D, the count is summarized as mD^2.

Deeper explanation: this estimate helps separate parameter memory from activation memory and KV Cache memory.

Pitfall: do not claim parameter count alone predicts training memory; optimizer states and activations can dominate.

Follow-up: explain how embedding size V*D changes the total.

How do you build a memory ledger for training?

Short answer: list weights, gradients, optimizer states, activations, temporary buffers, and communication buckets separately.

Deeper explanation: then mark which items are persistent, which appear only at runtime peaks, and which are replicated or sharded across ranks.

Pitfall: do not hide activation checkpointing, ZeRO/FSDP, and quantization under one generic "memory optimization" label.

Follow-up: estimate which owner dominates before choosing an optimization.

How do DDP and FSDP differ at the state level?

Short answer: DDP replicates parameters and synchronizes gradients; FSDP shards parameters, gradients, and optimizer state around module execution.

Deeper explanation: FSDP saves memory by materializing parameter shards when needed and reducing/scattering gradients during backward.

Pitfall: FSDP can still have runtime peak memory during all-gather or prefetch windows.

Follow-up: name the all-gather and reduce-scatter points in the step timeline.

How do you choose a parallelism layout for 64 GPUs?

Short answer: map the most frequent communication to the fastest links, usually keeping tensor parallelism inside a fast node or NVLink island.

Deeper explanation: choose TP, PP, DP/FSDP, context parallelism, or expert parallelism by naming the state and axis being split.

Pitfall: do not pick degrees before checking topology and bucket/activation traffic.

Follow-up: explain which collectives cross nodes and how that affects failure recovery.

How do you validate a systems claim?

Short answer: state the claim, hold the workload fixed, measure the relevant bottleneck, and change one variable.

Deeper explanation: a good experiment says whether the result is limited by memory bandwidth, compute, communication, scheduler behavior, or request mix.

Pitfall: do not compare tokens/s across different prompt/output distributions and call it a serving win.

Follow-up: show the counter or trace that would falsify your explanation.

How do prefill and decode stress serving differently?

Short answer: prefill handles the prompt and drives TTFT; decode repeatedly reads KV Cache and drives TPOT.

Deeper explanation: batching helps utilization, but longer prompts and retained context increase KV Cache pressure and tail latency.

Pitfall: do not explain serving latency with one average latency number.

Follow-up: separate queue time, prefill time, decode TPOT, and P95/P99.

How do you explain a memory-bound kernel?

Short answer: the kernel moves bytes faster than it uses math units, so improving layout, reuse, or fusion matters more than adding FLOPs.

Deeper explanation: point to HBM throughput, low compute utilization, memory transactions, or shared-memory conflicts.

Pitfall: do not call everything memory-bound just because memory is large; use profiler counters.

Follow-up: suggest one experiment: tiling, fusion, quantization, or FlashAttention depending on the workload.

Interview Practice

Use these representative prompts to rehearse mechanisms and tradeoffs. The full Q&A lives in the interview section so this handbook stays concept-first.

What does AI infrastructure optimize beyond model accuracy?
How would you debug a slow ML training job?
What signals belong in ML observability?
What makes LLM infrastructure different from ordinary model serving?

Runtime Extensions

New system branches

Continue from core training and serving mechanisms into specialized runtime domains.

RL Infrastructure

Rollout production, policy freshness, placement, distributed checkpointing, and recovery.

Multimodal Serving

Stage orchestration, streamed outputs, heterogeneous batching, and intermediate ownership.

Systems Runtime

RAII, bounded queues, backpressure, transports, NCCL errors, and timeout evidence.

Coding Practice

Executable exercises and calculators for memory, communication, latency, and rollout capacity.

Practical Labs

Annotated Code Reading Labs

Handbook explains the concepts.
Labs show the concepts through annotated code.
Starter files are optional source-of-truth examples.
Running the codeExplain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Lab	Concept	Code reading focus	Key mechanism	Interview takeaway	Link
01	Transformer Memory Accounting	Formula script for params, Adam states and KV Cache	Config values become a memory ledger	Training memory and inference memory have different dominant terms	Open
02	Single GPU Training Loop	Minimal PyTorch forward/backward/optimizer flow	Activations, gradients, optimizer state and checkpoint lifecycle	A training step is a resource lifecycle, not just a loss update	Open
03	DDP Conversion	torchrun, rank, process group and DDP wrapper	Gradient sync is hidden behind backward hooks and buckets	DDP replicates models and synchronizes gradients; it does not shard memory	Open
04	CUDA Reduce Optimization	Atomic and shared-memory reduction kernels	Block/thread cooperation turns global contention into staged aggregation	Reduction performance is about memory traffic, contention and synchronization	Open
05	Shared Memory Bank Conflict	Transpose tiles with and without padding	Tile shape changes shared-memory bank mapping	`tile[32][33]` is a mechanism example, not a magic constant	Open
06	Triton Fused Softmax	tl.load, masks, reductions and tl.store	One program handles a block/row and avoids intermediate HBM writes	Fusion helps when it removes real memory traffic or launch overhead	Open
07	FlashAttention Mental Model	Naive attention and online softmax walkthrough	Blockwise attention avoids materializing `S x S`	FlashAttention is exact attention reordered for IO awareness	Open
08	ZeRO / FSDP Memory Sharding	Memory formulas for replicated vs sharded states	State sharding trades memory for all-gather/reduce-scatter	Higher sharding stages reduce persistent state but add communication peaks	Open
09	Nsight Profiling Workflow	Command anatomy and profiling checklist	Systems timeline first, kernel counters second, hypothesis always	Profiling is evidence collection for one causal question at a time	Open
10	vLLM Serving Workload Config	Serving workload YAML and metric vocabulary	Prefill, decode, concurrency and KV Cache shape metrics	TTFT, TPOT, QPS and tokens/s only make sense with length distribution	Open
11	Quantization Comparison	Comparison plan for precision choices	Lower precision saves storage/bandwidth only when kernels and quality allow it	Always state what is quantized: weights, activations or KV Cache	Open
12	64-GPU Parallelism Design	Topology worksheet for TP/PP/DP/FSDP placement	Parallel axes must be mapped to fast and slow communication paths	A good answer explains communication hot paths, not only degrees	Open

References

Official references and supporting notes

Use official docs and papers for API behavior and version-sensitive claims; use blogs only to improve intuition.

Internal pages

Distributed Training HandbookDDP, FSDP, ZeRO, collectives, and parallelism layout. Transformer HandbookQKV, attention shape, RoPE, FFN, KV Cache, and inference mechanics. Diffusers HandbookDiffusion systems frame generation as iterative denoising. Read the model output contract, scheduler update, latent representation, conditioning path, and memory tradeoff before comparing model names. DiffSynth HandbookDiffusion systems frame generation as iterative denoising. Read the model output contract, scheduler update, latent representation, conditioning path, and memory tradeoff before comparing model names.

External references

AI Infra GuideHigh-level learning route and supporting intuition. NCCL CollectivesOfficial vocabulary for all-reduce, all-gather, reduce-scatter, broadcast, and all-to-all. CUDA Best PracticesMemory coalescing, shared memory, memory banks, bank conflicts and optimization guidance. Triton Fused SoftmaxConcrete example of blocking, online softmax, and reducing HBM traffic. Nsight SystemsSystem-level timeline for CPU, CUDA, memory copies, and NCCL activity. Nsight ComputeKernel-level counters for memory, occupancy, instructions, and shared-memory behavior. DeepSpeedZeRO, distributed training, and large-model optimization tooling. Megatron-LMTensor, pipeline, sequence/context, and expert-parallel training patterns. vLLMPagedAttention, continuous batching, KV Cache management, and serving behavior. SGLangStructured generation runtime and serving orchestration. TensorRT-LLMOptimized LLM inference runtime, engine build path, and deployment constraints. HF Attention BackendsTransformers attention implementation selection; exact option names/defaults are version-sensitive.