Parent systems map
Use this page to connect GPU kernels, memory accounting, distributed training, profiling, serving, quantization, and topology decisions into one AI systems fieldbook.
The umbrella track for compute, memory, communication, profiling, serving, and tradeoffs. Distributed Training and Transformer Systems are deep-dive branches of this track.
Use this page to connect GPU kernels, memory accounting, distributed training, profiling, serving, quantization, and topology decisions into one AI systems fieldbook.
Read this page in passes. First build the main thread: why Transformers create parameter, activation, and KV Cache pressure; why GPUs are often limited by HBM traffic; why distributed training needs collectives; and why inference serving becomes a scheduling and cache-management problem. On the second pass, restate each concept as problem, mechanism, savings, cost, and measurement.
| Thread | Core pressure | Question to answer |
|---|---|---|
| Compute | Kernels are often limited by memory movement, not only FLOPs. | Which bytes can be reused, fused, or avoided? |
| Memory | Training state and serving-time KV Cache need separate accounting. | Which state grows, and which optimization changes it? |
| Communication | Distributed training is shaped by rank ownership and critical-path collectives. | Which collective crosses which topology boundary? |
Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
Inference focuses on prefill, decode, batching, KV Cache capacity, scheduler behavior, and tail latency under real request distributions.
Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
For input X: (B,S,D), the block projects Q=XWq, K=XWk, and V=XWv, splits them across H heads with Dh=D/H, and computes attention scores shaped (B,H,S,S). The output returns to (B,S,D) before the FFN and residual path.
Q/K/V shape: X(B,S,D) * Wq(D,D) -> Q(B,S,D) -> Q(B,H,S,D/H)
Attention score: Q(B,H,S,D/H) @ K^T(B,H,D/H,S) -> (B,H,S,S)
Connection: the S x S score matrix explains why long context stresses attention memory and why FlashAttention-style kernels matter.
Self-attention projections are roughly 4D^2 per layer: Q, K, V, and output projection. The FFN is usually 2 * D * Dff; if Dff=4D, that is about 8D^2. Many model cards summarize the layer as mD^2, with embeddings adding V*D.
For a rough estimate, plug values such as D=4096,L=32,V=32000,m=8 into the calculator below, then compare the result with the memory ledger rather than treating parameter count as total memory.
| Phase | Major resident state | What grows memory | Matching optimization |
|---|---|---|---|
| Training | Weights, gradients, optimizer states such as Adam m and v, activations, and temporary buffers. | Model state, batch size, sequence length, and saved activations. | ZeRO/FSDP, activation checkpointing, and memory-efficient attention. |
| Inference | Weights, runtime buffers, and KV Cache. | Concurrent sequences and retained context length. | Quantization, paging, cache-aware scheduling, and smaller KV representations. |
Do not mix these into one vague "memory" number. A useful memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache.
Attention activations grow with sequence length, while KV Cache grows with retained context during serving. That difference is why training and inference memory fixes are not interchangeable.
Transformer parameter count -> training memory ledger -> single-GPU memory failure -> ZeRO/FSDP partitions parameters, gradients, and optimizer states. Attention's S x S intermediate -> HBM writes and activation pressure -> FlashAttention uses blockwise online softmax to avoid materializing the full score matrix. Inference weights + KV Cache -> PagedAttention and KV quantization become useful serving tools.
Name each memory owner first: weights, activations, gradients, optimizer states, temporary buffers, communication buckets, and KV Cache. Then explain which optimization shards, shrinks, moves, or recomputes that state.
Do not say every memory fix solves the same problem. FlashAttention targets attention intermediates, ZeRO/FSDP shards training state, activation checkpointing recomputes activations, and KV Cache work mainly affects inference serving.
Use these quick estimates to keep parameter count, communication traffic, and training memory in separate boxes.
The numbers are intentionally rough. They are good for interview reasoning and sanity checks, not for replacing profiler traces or framework memory summaries.
Estimate each layer as attention 4D^2, FFN mD^2, plus embedding V*D.
Approximate ring all-reduce traffic per rank as 2(N-1)/N * data, then compare fast and slow interconnect assumptions.
Estimate persistent training state: weights, gradients, and optimizer states. Runtime peaks can still be higher.
Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Start from where bytes live and how often they move. A kernel can have plenty of math available and still be slow if it repeatedly reads from HBM or writes large intermediates.
| Mechanism | Main idea | Concrete takeaway |
|---|---|---|
| Coalesced memory access | Arrange neighboring threads to read neighboring addresses. | Inspect address layout before blaming arithmetic throughput. |
| Shared memory | Stage reused data close to the thread block. | Reuse a loaded tile before returning to HBM. |
| Bank conflict | Avoid many threads contending for one shared-memory bank. | Padding tile[32][32] to tile[32][33] is the classic transpose fix. |
| Reduce implementations | Move from naive dependencies to shared-memory reuse and then warp-level synchronization. | Compare synchronization cost as the reduction scope shrinks. |
| GEMM tiling | Reuse matrix tiles before returning to HBM. | Increase useful arithmetic per byte fetched. |
| Softmax fusion | Keep intermediate values close to the kernel. | Avoid materializing extra tensors when numerics and shapes allow it. |
Naive attention materializes the (B,H,S,S) score matrix. FlashAttention-style kernels compute attention in blocks with online softmax so the full score matrix does not need to be written to HBM.
| Optimization point | Problem solved | Mechanism | What to measure | Common misunderstanding |
|---|---|---|---|---|
| Coalescing | Uncoalesced global reads waste memory transactions. | Make adjacent threads access adjacent addresses. | global load/store efficiency, memory throughput | Coalescing is about address pattern, not only total bytes. |
| Shared memory tiling | Repeated HBM reads dominate arithmetic. | Load tiles once, reuse them inside the block. | HBM throughput, L2 hit, occupancy | Shared memory can also lower occupancy if tile size is too large. |
| Padding transpose | Shared memory bank conflicts serialize access. | Pad the layout, for example 32x33, to shift bank mapping. | shared bank conflict metric | Padding fixes one access pattern, not every layout problem. |
| Kernel fusion | Intermediate tensors create extra reads, writes, and launch overhead. | Combine adjacent operations into one kernel when shapes and numerics allow it. | kernel count, HBM writes, launch overhead | Fusion can increase register pressure or reduce reuse if applied blindly. |
| FlashAttention | S x S attention memory and HBM traffic | blockwise Q/K/V + online softmax | memory peak, attention kernel time, tokens/s | It is an attention algorithm/kernel strategy, not a new model architecture. |
Compare achieved memory bandwidth, Tensor Core utilization, occupancy, and instruction mix. If memory throughput is high while compute utilization is low, the next question is layout, tiling, fusion, or reducing bytes moved.
Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
DDP keeps a full model replica on each rank, feeds each rank a different data shard, and all-reduces gradient buckets during backward so all replicas apply the same update.
The user code looks local, but DDP installs autograd hooks. When gradients are produced, those hooks bucket the gradients and launch collectives through the process group.
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
model = DDP(build_model().cuda(), device_ids=[local_rank])
sampler = DistributedSampler(train_dataset, shuffle=True)
for epoch in range(num_epochs):
sampler.set_epoch(epoch)
for batch in loader:
loss = model(**move_to_cuda(batch)).loss
loss.backward() # autograd hooks trigger gradient bucket all-reduce
optimizer.step()
optimizer.zero_grad(set_to_none=True)
ZeRO and FSDP reduce per-GPU memory by sharding optimizer states, gradients, and parameters at different levels. That saves memory but adds all-gather, reduce-scatter, checkpoint, and runtime peak-memory considerations.
| Method | What is partitioned | Key communication | What it saves | What it costs | What to measure |
|---|---|---|---|---|---|
| DDP | Input data only; parameters are replicated. | gradient all-reduce | Throughput via more workers | Full model and optimizer memory per GPU | step time, all-reduce overlap, samples/s |
| ZeRO-1 | optimizer states | optimizer state partitioning | Adam states memory | More state movement and optimizer complexity | per-GPU optimizer memory, step overhead |
| ZeRO-2 | optimizer states + gradients | reduce-scatter / all-gather | gradients and Adam states memory | More collective traffic on the backward path | gradient memory, NCCL time |
| ZeRO-3 / FSDP | parameters + gradients + optimizer states | parameter all-gather, gradient reduce-scatter | Most persistent training state memory | Parameter materialization peaks and shard checkpoint complexity | peak memory, gather time, shard checkpoint cost |
| Parallelism | What it splits or routes | Primary systems cost |
|---|---|---|
| Tensor Parallel (TP) | Layer-internal matrix work. | High-frequency collectives; keep within a fast interconnect domain. |
| Pipeline Parallel (PP) | Model layers into stages. | Pipeline bubbles and stage balancing. |
| Data Parallel (DP) | Input batches while replicating or sharding model state. | Gradient or state synchronization. |
| Sequence / Context Parallelism | Sequence or context work. | Attention communication while reducing activation memory. |
| Expert Parallelism | Tokens routed across expert owners. | All-to-all traffic and load balancing. |
| 3D Parallelism | Multiple parallel axes together. | Topology-aware mapping of every collective. |
Keep high-frequency collectives such as tensor-parallel all-reduce inside NVLink when possible, map slower data-parallel communication across nodes, and explain how the placement changes failure domains and checkpoint traffic.
Adding GPUs is not automatically a data-parallel speedup. The real design choice is which state is replicated, which state is sharded, and which collective or all-to-all exchange lands on the critical path.
Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
Ring AllReduce per-GPU traffic: 2 * (N - 1) / N * data_size
Bandwidth vs latency: large buckets are usually bandwidth-oriented, while many small collectives can be dominated by launch, synchronization, and topology latency.
nvidia-smi topo -m How to read itTopology labels tell you whether traffic stays on fast GPU links or crosses PCIe, host bridges, NUMA boundaries, or the system interconnect. Use them to place ranks, not just to describe hardware.
| Label | Intuition | Placement implication |
|---|---|---|
NV# | GPUs are connected by NVLink links. | Keep tensor-parallel or high-frequency collectives here when possible. |
PIX | Traffic goes through a single PCIe switch. | Acceptable for moderate traffic, but weaker than NVLink. |
PHB | Traffic crosses a PCIe host bridge. | Avoid placing the hottest collectives across this path. |
NODE | Traffic may cross NUMA boundaries inside one node. | Bind CPU, NIC, and GPU placement deliberately. |
SYS | Traffic crosses system-level interconnect boundaries. | Treat it as a slow path for frequent GPU-GPU communication. |
First decide whether the wait is compute, memory, communication, or orchestration. Then tie the diagnosis to a trace, counter, or controlled experiment.
| Bottleneck | Signal | Typical cause | Next experiment |
|---|---|---|---|
| Compute-bound | Tensor Core utilization is high. | dtype, tile shape, kernel library, or batch shape | Change dtype, kernel backend, or problem shape. |
| Memory-bound | HBM throughput high, compute utilization low. | Excessive reads/writes or poor locality | Try fusion, tiling, FlashAttention, quantization, or layout changes. |
| Communication-bound | NCCL ranges dominate the step timeline. | Large buckets, slow topology path, or poor overlap | Change rank placement, bucket size, overlap settings, or parallelism axes. |
A good performance explanation names the slow path, shows the evidence, and proposes one controlled experiment.
| Tool | Question it answers | Evidence scope |
|---|---|---|
| Nsight Systems | Where does elapsed time go? | CPU work, CUDA kernels, memory copies, and NCCL on a timeline. |
| Nsight Compute | Why does one kernel behave this way? | Occupancy, instruction mix, memory throughput, and bank conflicts. |
| PyTorch Profiler | Which framework operation created the work? | Operator-to-kernel attribution and high-level traces. |
Roofline reasoning asks whether arithmetic intensity is high enough to use available compute. If it is not, optimizing memory movement may matter more than adding more math throughput.
| Symptom | Likely cause | Tool evidence | Next move |
|---|---|---|---|
| GPU gaps between kernels | dataloader, tokenizer, CPU sync, H2D, or launch overhead | Nsight Systems CPU/GPU timeline | Fix input pipeline or remove synchronization. |
| NCCL dominates step time | communication-bound parallel layout | nsys trace + topology mapping | Revisit rank placement, overlap, and bucket size. |
| Memory throughput near peak | HBM-bound kernel or extra tensor materialization | Nsight Compute memory metrics | Try tiling, fusion, FlashAttention, quantization, or layout changes. |
| Shared memory conflict high | bank conflict or poor tile layout | NCU shared memory bank conflict | Pad or reshape shared-memory tiles. |
| Serving P99 high | queueing, long prompts, decode pressure, or KV Cache fragmentation | request traces + scheduler metrics | Separate prefill and decode metrics before tuning batching. |
torchrun --nproc-per-node=8 train.py --config config.yaml
nsys profile --trace=cuda,nvtx,osrt --output=step_report python train.py
ncu --set full --target-processes all python kernel_bench.py
# Serving comparison: keep prompt/output length distributions fixed before comparing engines.
Serving work is not one uniform forward pass. Separate prompt ingestion from token-by-token generation before reasoning about throughput, latency, or cache capacity.
A request first enters a scheduler or queue, then prefill computes the prompt and writes KV Cache. Decode repeatedly reads that cache, appends one token step, and streams output while the scheduler admits more work.
| Phase | State transition | Dominant metric | Typical pressure |
|---|---|---|---|
| Prefill | Processes prompt tokens and builds the initial KV Cache. | TTFT | Compute and attention work for long prompts. |
| Decode | Reads and appends KV Cache one generated token at a time. | TPOT | Cache capacity, memory bandwidth, and scheduling under load. |
KV Cache memory scales with batch * layers * kv_heads * head_dim * sequence_length * 2(K,V) * bytes. That is why request length distribution and cache management are first-order serving concerns.
2 * layers * batch * seq_len * kv_heads * head_dim * bytes
2 counts key and value tensors.batch and seq_len capture concurrent requests and cached context length.kv_heads can be smaller than attention heads for MQA/GQA models.bytes depends on FP16/BF16 or cache quantization support.request arrives
-> scheduler / queue
-> prefill builds KV Cache and determines TTFT
-> decode loop reads KV Cache one step at a time
-> stream output tokens and track TPOT, P95/P99
PagedAttention-style cache management reduces waste from variable-length requests. Continuous batching improves GPU utilization by mixing requests at different decode positions, but it can increase queueing complexity and tail-latency risk.
Quantization can reduce memory and bandwidth pressure when kernels support the target dtype and accuracy remains acceptable. Speculative decoding helps only when acceptance rate, draft-model cost, and scheduler overhead make the extra machinery worthwhile.
| Metric | Meaning | What it tells you |
|---|---|---|
| TTFT | Time To First Token | Prefill, queueing, and admission behavior. |
| TPOT | Time Per Output Token | Decode loop cost and cache-read pressure. |
| QPS | Requests served per second | Admission capacity under a specific traffic mix. |
| tokens/s | Generated tokens per second | Throughput, but only meaningful with prompt/output length distributions. |
| P95 / P99 | Tail latency percentiles | Queueing, long-context pressure, or scheduling unfairness. |
| System | Useful mental model | Check before claiming |
|---|---|---|
| vLLM | PagedAttention, continuous batching, and serving scheduler behavior. | Version, backend, supported models, and workload shape. |
| SGLang | Structured generation, runtime scheduling, and serving orchestration. | Frontend language features, backend engine, and cache behavior. |
| TensorRT-LLM | Optimized inference runtime with engine build and kernel choices. | Hardware, dtype, quantization path, and engine build constraints. |
Separate queue time, prefill time, decode TPOT, and KV Cache pressure. Then say whether the evidence points to scheduling, memory bandwidth, cache fragmentation, or request-shape skew.
Use this matrix to avoid generic claims. Each row should connect a concrete bottleneck to a mechanism, cost, and measurement.
The most common mistake is naming a technique without naming the state object or resource it changes.
| Technique | Problem | Mechanism | Saves what | Costs what | When to use | When not to use | What to measure |
|---|---|---|---|---|---|---|---|
| ZeRO / FSDP | Training state does not fit per GPU. | Shard parameters, gradients, and/or optimizer states. | Persistent training memory | Extra collectives, materialization peaks, checkpoint complexity | Large training jobs where memory is the limit | Small models where communication dominates | peak memory, NCCL time, step time |
| Activation checkpointing | Activations dominate training memory. | Drop selected activations and recompute them in backward. | activation memory | extra compute and longer step time | Memory-bound training with spare compute | Compute-bound training already near the time budget | peak memory, step time, recompute overhead |
| FlashAttention | Attention S x S intermediates and HBM traffic are large. | blockwise Q/K/V + online softmax | HBM traffic, attention memory | backend constraints and kernel compatibility | Long context or attention-heavy workloads | Unsupported masks, dtypes, layouts, or hardware | attention time, memory peak, tokens/s |
| Tensor Parallel | One layer's matrix work is too large for one GPU. | Split layer-internal computation across ranks. | per-GPU compute and parameter load | high-frequency collectives | Fast intra-node interconnect such as NVLink | Slow cross-node paths for small layers | NCCL time per layer, MFU, step time |
| Pipeline Parallel | Model depth does not fit or scale well on one device group. | Split layers into pipeline stages. | per-stage model memory | bubble, stage imbalance, activation buffering | Very deep models with balanced stages | Small batch or uneven stage timing | bubble ratio, stage time, utilization |
| SP / CP | Sequence or context memory is too large. | Split sequence/context dimension across ranks. | activation/KV/context memory | all-to-all/gather, mask and kernel complexity | Long-context training or inference | Short sequences where overhead dominates | memory peak, collective time, correctness tests |
| Expert Parallel | MoE experts increase capacity but cannot all run on every rank. | Route tokens to experts placed on different ranks. | expert parameter and compute placement | all-to-all, load imbalance, capacity drops | MoE models with enough token volume | Highly skewed routing without capacity control | all-to-all time, expert load, dropped tokens |
| Quantization | Weights or KV Cache stress memory and bandwidth. | Store or compute with fewer bits when kernels support it. | memory, HBM bandwidth | accuracy risk, calibration, kernel support | Serving workloads limited by memory traffic | Quality-sensitive paths without validation | quality eval, tokens/s, TPOT, memory |
| Continuous batching | GPU idles while requests arrive and finish at different times. | Admit new requests into ongoing decode batches. | GPU idle time, goodput | scheduling complexity and queueing variance | Mixed online serving traffic | Strict isolation or very predictable single-request traffic | QPS, TPOT, P95/P99, queue time |
| Speculative decoding | Decode is slow one token at a time. | Draft tokens with a cheaper model and verify with the target model. | wall-clock decode time | draft-model cost, acceptance-rate sensitivity | High acceptance rate with cheap draft model | Low acceptance rate or tight quality constraints | acceptance rate, TPOT, quality |
| Triton custom kernel | Framework kernels leave a clear hot path. | Write a workload-specific fused or tiled kernel. | kernel launches, HBM traffic, overhead | maintenance and correctness burden | Stable hot workload with measurable headroom | Rapidly changing shapes or fragile numerics | kernel time, correctness, maintenance cost |
Use these prompts to turn mechanisms into interview-ready explanations.
Short answer: start from X(B,S,D), project Q/K/V, split heads to (B,H,S,D/H), form scores (B,H,S,S), then return to (B,S,D).
Deeper explanation: the shape trace tells you where attention memory grows, where FFN parameters live, and which axis a parallel strategy would split.
Pitfall: do not treat the S x S score matrix as a permanent model parameter; it is an activation/intermediate.
Follow-up: connect the shape trace to FlashAttention and KV Cache memory.
mD^2?Short answer: a two-layer FFN has roughly 2 * D * Dff parameters; when Dff is a multiple of D, the count is summarized as mD^2.
Deeper explanation: this estimate helps separate parameter memory from activation memory and KV Cache memory.
Pitfall: do not claim parameter count alone predicts training memory; optimizer states and activations can dominate.
Follow-up: explain how embedding size V*D changes the total.
Short answer: list weights, gradients, optimizer states, activations, temporary buffers, and communication buckets separately.
Deeper explanation: then mark which items are persistent, which appear only at runtime peaks, and which are replicated or sharded across ranks.
Pitfall: do not hide activation checkpointing, ZeRO/FSDP, and quantization under one generic "memory optimization" label.
Follow-up: estimate which owner dominates before choosing an optimization.
Short answer: DDP replicates parameters and synchronizes gradients; FSDP shards parameters, gradients, and optimizer state around module execution.
Deeper explanation: FSDP saves memory by materializing parameter shards when needed and reducing/scattering gradients during backward.
Pitfall: FSDP can still have runtime peak memory during all-gather or prefetch windows.
Follow-up: name the all-gather and reduce-scatter points in the step timeline.
Short answer: map the most frequent communication to the fastest links, usually keeping tensor parallelism inside a fast node or NVLink island.
Deeper explanation: choose TP, PP, DP/FSDP, context parallelism, or expert parallelism by naming the state and axis being split.
Pitfall: do not pick degrees before checking topology and bucket/activation traffic.
Follow-up: explain which collectives cross nodes and how that affects failure recovery.
Short answer: state the claim, hold the workload fixed, measure the relevant bottleneck, and change one variable.
Deeper explanation: a good experiment says whether the result is limited by memory bandwidth, compute, communication, scheduler behavior, or request mix.
Pitfall: do not compare tokens/s across different prompt/output distributions and call it a serving win.
Follow-up: show the counter or trace that would falsify your explanation.
Short answer: prefill handles the prompt and drives TTFT; decode repeatedly reads KV Cache and drives TPOT.
Deeper explanation: batching helps utilization, but longer prompts and retained context increase KV Cache pressure and tail latency.
Pitfall: do not explain serving latency with one average latency number.
Follow-up: separate queue time, prefill time, decode TPOT, and P95/P99.
Short answer: the kernel moves bytes faster than it uses math units, so improving layout, reuse, or fusion matters more than adding FLOPs.
Deeper explanation: point to HBM throughput, low compute utilization, memory transactions, or shared-memory conflicts.
Pitfall: do not call everything memory-bound just because memory is large; use profiler counters.
Follow-up: suggest one experiment: tiling, fusion, quantization, or FlashAttention depending on the workload.
Use these representative prompts to rehearse mechanisms and tradeoffs. The full Q&A lives in the interview section so this handbook stays concept-first.
Continue from core training and serving mechanisms into specialized runtime domains.
Rollout production, policy freshness, placement, distributed checkpointing, and recovery.
Stage orchestration, streamed outputs, heterogeneous batching, and intermediate ownership.
RAII, bounded queues, backpressure, transports, NCCL errors, and timeout evidence.
Executable exercises and calculators for memory, communication, latency, and rollout capacity.
Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
| Lab | Concept | Code reading focus | Key mechanism | Interview takeaway | Link |
|---|---|---|---|---|---|
| 01 | Transformer Memory Accounting | Formula script for params, Adam states and KV Cache | Config values become a memory ledger | Training memory and inference memory have different dominant terms | Open |
| 02 | Single GPU Training Loop | Minimal PyTorch forward/backward/optimizer flow | Activations, gradients, optimizer state and checkpoint lifecycle | A training step is a resource lifecycle, not just a loss update | Open |
| 03 | DDP Conversion | torchrun, rank, process group and DDP wrapper | Gradient sync is hidden behind backward hooks and buckets | DDP replicates models and synchronizes gradients; it does not shard memory | Open |
| 04 | CUDA Reduce Optimization | Atomic and shared-memory reduction kernels | Block/thread cooperation turns global contention into staged aggregation | Reduction performance is about memory traffic, contention and synchronization | Open |
| 05 | Shared Memory Bank Conflict | Transpose tiles with and without padding | Tile shape changes shared-memory bank mapping | tile[32][33] is a mechanism example, not a magic constant | Open |
| 06 | Triton Fused Softmax | tl.load, masks, reductions and tl.store | One program handles a block/row and avoids intermediate HBM writes | Fusion helps when it removes real memory traffic or launch overhead | Open |
| 07 | FlashAttention Mental Model | Naive attention and online softmax walkthrough | Blockwise attention avoids materializing S x S | FlashAttention is exact attention reordered for IO awareness | Open |
| 08 | ZeRO / FSDP Memory Sharding | Memory formulas for replicated vs sharded states | State sharding trades memory for all-gather/reduce-scatter | Higher sharding stages reduce persistent state but add communication peaks | Open |
| 09 | Nsight Profiling Workflow | Command anatomy and profiling checklist | Systems timeline first, kernel counters second, hypothesis always | Profiling is evidence collection for one causal question at a time | Open |
| 10 | vLLM Serving Workload Config | Serving workload YAML and metric vocabulary | Prefill, decode, concurrency and KV Cache shape metrics | TTFT, TPOT, QPS and tokens/s only make sense with length distribution | Open |
| 11 | Quantization Comparison | Comparison plan for precision choices | Lower precision saves storage/bandwidth only when kernels and quality allow it | Always state what is quantized: weights, activations or KV Cache | Open |
| 12 | 64-GPU Parallelism Design | Topology worksheet for TP/PP/DP/FSDP placement | Parallel axes must be mapped to fast and slow communication paths | A good answer explains communication hot paths, not only degrees | Open |
Use official docs and papers for API behavior and version-sensitive claims; use blogs only to improve intuition.