AI Infra Umbrella

AI Infra Systems Map

The umbrella track for compute, memory, communication, profiling, serving, and tradeoffs. Distributed Training and Transformer Systems are deep-dive branches of this track.

#
What this track is

Parent systems map

Use this page to connect GPU kernels, memory accounting, distributed training, profiling, serving, quantization, and topology decisions into one AI systems fieldbook.

Read this page in passes. First build the main thread: why Transformers create parameter, activation, and KV Cache pressure; why GPUs are often limited by HBM traffic; why distributed training needs collectives; and why inference serving becomes a scheduling and cache-management problem. On the second pass, restate each concept as problem, mechanism, savings, cost, and measurement.

Three main threads
ThreadCore pressureQuestion to answer
ComputeKernels are often limited by memory movement, not only FLOPs.Which bytes can be reused, fused, or avoided?
MemoryTraining state and serving-time KV Cache need separate accounting.Which state grows, and which optimization changes it?
CommunicationDistributed training is shaped by rank ownership and critical-path collectives.Which collective crosses which topology boundary?

What training focuses on

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

What inference focuses on

Inference focuses on prefill, decode, batching, KV Cache capacity, scheduler behavior, and tail latency under real request distributions.

Suggested learning order

  1. Start with Transformer shapes, parameter count, activation memory, and KV Cache memory.
  2. Then learn GPU memory hierarchy: HBM, L2, shared memory, registers, coalescing, bank conflicts, tiling, and fusion.
  3. Move to DDP, ZeRO/FSDP, tensor/pipeline/context/expert parallelism, and NCCL collectives.
  4. Use profiling to decide whether the bottleneck is CPU orchestration, kernel math, memory traffic, or communication.
  5. Finally move to serving: prefill, decode, KV Cache, PagedAttention, continuous batching, quantization, and serving metrics.
Foundation

Transformer and Memory Accounting

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

#

Concept explanation: decoder-only block data flow

For input X: (B,S,D), the block projects Q=XWq, K=XWk, and V=XWv, splits them across H heads with Dh=D/H, and computes attention scores shaped (B,H,S,S). The output returns to (B,S,D) before the FFN and residual path.

Q/K/V shape: X(B,S,D) * Wq(D,D) -> Q(B,S,D) -> Q(B,H,S,D/H)

Attention score: Q(B,H,S,D/H) @ K^T(B,H,D/H,S) -> (B,H,S,S)

Connection: the S x S score matrix explains why long context stresses attention memory and why FlashAttention-style kernels matter.

Parameter count mental model

Self-attention projections are roughly 4D^2 per layer: Q, K, V, and output projection. The FFN is usually 2 * D * Dff; if Dff=4D, that is about 8D^2. Many model cards summarize the layer as mD^2, with embeddings adding V*D.

For a rough estimate, plug values such as D=4096,L=32,V=32000,m=8 into the calculator below, then compare the result with the memory ledger rather than treating parameter count as total memory.

Training memory vs inference memory

PhaseMajor resident stateWhat grows memoryMatching optimization
TrainingWeights, gradients, optimizer states such as Adam m and v, activations, and temporary buffers.Model state, batch size, sequence length, and saved activations.ZeRO/FSDP, activation checkpointing, and memory-efficient attention.
InferenceWeights, runtime buffers, and KV Cache.Concurrent sequences and retained context length.Quantization, paging, cache-aware scheduling, and smaller KV representations.

Do not mix these into one vague "memory" number. A useful memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache.

Attention activations grow with sequence length, while KV Cache grows with retained context during serving. That difference is why training and inference memory fixes are not interchangeable.

How would you explain the memory ledger in an interview?

Name each memory owner first: weights, activations, gradients, optimizer states, temporary buffers, communication buckets, and KV Cache. Then explain which optimization shards, shrinks, moves, or recomputes that state.

Common misunderstanding

Do not say every memory fix solves the same problem. FlashAttention targets attention intermediates, ZeRO/FSDP shards training state, activation checkpointing recomputes activations, and KV Cache work mainly affects inference serving.

Calculators

Estimation tools

Use these quick estimates to keep parameter count, communication traffic, and training memory in separate boxes.

#

The numbers are intentionally rough. They are good for interview reasoning and sanity checks, not for replacing profiler traces or framework memory summaries.

Parameter-count estimator

Estimate each layer as attention 4D^2, FFN mD^2, plus embedding V*D.

Ring AllReduce time estimator

Approximate ring all-reduce traffic per rank as 2(N-1)/N * data, then compare fast and slow interconnect assumptions.

Training memory ledger estimator

Estimate persistent training state: weights, gradients, and optimizer states. Runtime peaks can still be higher.

GPU Kernel and Memory Hierarchy

From HBM to Triton / FlashAttention

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

#

Memory hierarchy mental model

Start from where bytes live and how often they move. A kernel can have plenty of math available and still be slow if it repeatedly reads from HBM or writes large intermediates.

Register
Per-thread storage, fastest but limited; excessive use can reduce occupancy.
Shared memory
Block-local scratch space used for tiling, reuse, reductions, and avoiding repeated HBM reads.
L2 / cache
On-GPU cache layer that helps reuse across memory operations but is still much slower than registers.
HBM / global memory
Large device memory with high bandwidth, but expensive enough that layout and access pattern dominate many kernels.

Key mechanisms

MechanismMain ideaConcrete takeaway
Coalesced memory accessArrange neighboring threads to read neighboring addresses.Inspect address layout before blaming arithmetic throughput.
Shared memoryStage reused data close to the thread block.Reuse a loaded tile before returning to HBM.
Bank conflictAvoid many threads contending for one shared-memory bank.Padding tile[32][32] to tile[32][33] is the classic transpose fix.
Reduce implementationsMove from naive dependencies to shared-memory reuse and then warp-level synchronization.Compare synchronization cost as the reduction scope shrinks.
GEMM tilingReuse matrix tiles before returning to HBM.Increase useful arithmetic per byte fetched.
Softmax fusionKeep intermediate values close to the kernel.Avoid materializing extra tensors when numerics and shapes allow it.

FlashAttention mental model

Naive attention materializes the (B,H,S,S) score matrix. FlashAttention-style kernels compute attention in blocks with online softmax so the full score matrix does not need to be written to HBM.

Optimization pointProblem solvedMechanismWhat to measureCommon misunderstanding
CoalescingUncoalesced global reads waste memory transactions.Make adjacent threads access adjacent addresses.global load/store efficiency, memory throughputCoalescing is about address pattern, not only total bytes.
Shared memory tilingRepeated HBM reads dominate arithmetic.Load tiles once, reuse them inside the block.HBM throughput, L2 hit, occupancyShared memory can also lower occupancy if tile size is too large.
Padding transposeShared memory bank conflicts serialize access.Pad the layout, for example 32x33, to shift bank mapping.shared bank conflict metricPadding fixes one access pattern, not every layout problem.
Kernel fusionIntermediate tensors create extra reads, writes, and launch overhead.Combine adjacent operations into one kernel when shapes and numerics allow it.kernel count, HBM writes, launch overheadFusion can increase register pressure or reduce reuse if applied blindly.
FlashAttentionS x S attention memory and HBM trafficblockwise Q/K/V + online softmaxmemory peak, attention kernel time, tokens/sIt is an attention algorithm/kernel strategy, not a new model architecture.
How do you tell whether a kernel is bandwidth-bound?

Compare achieved memory bandwidth, Tensor Core utilization, occupancy, and instruction mix. If memory throughput is high while compute utilization is low, the next question is layout, tiling, fusion, or reducing bytes moved.

Distributed Training

Replicas, shards, collectives, and optimizer semantics

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

#

Data parallel baseline

DDP keeps a full model replica on each rank, feeds each rank a different data shard, and all-reduces gradient buckets during backward so all replicas apply the same update.

DDP code path to recognize

The user code looks local, but DDP installs autograd hooks. When gradients are produced, those hooks bucket the gradients and launch collectives through the process group.

dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

model = DDP(build_model().cuda(), device_ids=[local_rank])
sampler = DistributedSampler(train_dataset, shuffle=True)

for epoch in range(num_epochs):
    sampler.set_epoch(epoch)
    for batch in loader:
        loss = model(**move_to_cuda(batch)).loss
        loss.backward()       # autograd hooks trigger gradient bucket all-reduce
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Sharding training state

ZeRO and FSDP reduce per-GPU memory by sharding optimizer states, gradients, and parameters at different levels. That saves memory but adds all-gather, reduce-scatter, checkpoint, and runtime peak-memory considerations.

MethodWhat is partitionedKey communicationWhat it savesWhat it costsWhat to measure
DDPInput data only; parameters are replicated.gradient all-reduceThroughput via more workersFull model and optimizer memory per GPUstep time, all-reduce overlap, samples/s
ZeRO-1optimizer statesoptimizer state partitioningAdam states memoryMore state movement and optimizer complexityper-GPU optimizer memory, step overhead
ZeRO-2optimizer states + gradientsreduce-scatter / all-gathergradients and Adam states memoryMore collective traffic on the backward pathgradient memory, NCCL time
ZeRO-3 / FSDPparameters + gradients + optimizer statesparameter all-gather, gradient reduce-scatterMost persistent training state memoryParameter materialization peaks and shard checkpoint complexitypeak memory, gather time, shard checkpoint cost

Parallelism families

ParallelismWhat it splits or routesPrimary systems cost
Tensor Parallel (TP)Layer-internal matrix work.High-frequency collectives; keep within a fast interconnect domain.
Pipeline Parallel (PP)Model layers into stages.Pipeline bubbles and stage balancing.
Data Parallel (DP)Input batches while replicating or sharding model state.Gradient or state synchronization.
Sequence / Context ParallelismSequence or context work.Attention communication while reducing activation memory.
Expert ParallelismTokens routed across expert owners.All-to-all traffic and load balancing.
3D ParallelismMultiple parallel axes together.Topology-aware mapping of every collective.
64 GPU topology design example

Keep high-frequency collectives such as tensor-parallel all-reduce inside NVLink when possible, map slower data-parallel communication across nodes, and explain how the placement changes failure domains and checkpoint traffic.

Common misunderstanding

Adding GPUs is not automatically a data-parallel speedup. The real design choice is which state is replicated, which state is sharded, and which collective or all-to-all exchange lands on the critical path.

Communication and Topology

Collectives, topology, and bottleneck diagnosis

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

#

Collective input/output intuition

AllReduce
Every rank contributes a tensor and every rank receives the reduced result. DDP gradient sync is the common example.
AllGather
Each rank starts with one shard and receives all shards, often used before a sharded parameter is needed.
ReduceScatter
Ranks reduce inputs and scatter the reduced shards, commonly paired with all-gather in sharded training.
AllToAll
Each rank sends a different slice to every other rank. Expert parallelism and token routing often pay this cost.

Ring AllReduce per-GPU traffic: 2 * (N - 1) / N * data_size

Bandwidth vs latency: large buckets are usually bandwidth-oriented, while many small collectives can be dominated by launch, synchronization, and topology latency.

nvidia-smi topo -m How to read it

Topology labels tell you whether traffic stays on fast GPU links or crosses PCIe, host bridges, NUMA boundaries, or the system interconnect. Use them to place ranks, not just to describe hardware.

LabelIntuitionPlacement implication
NV#GPUs are connected by NVLink links.Keep tensor-parallel or high-frequency collectives here when possible.
PIXTraffic goes through a single PCIe switch.Acceptable for moderate traffic, but weaker than NVLink.
PHBTraffic crosses a PCIe host bridge.Avoid placing the hottest collectives across this path.
NODETraffic may cross NUMA boundaries inside one node.Bind CPU, NIC, and GPU placement deliberately.
SYSTraffic crosses system-level interconnect boundaries.Treat it as a slow path for frequent GPU-GPU communication.

Diagnosing the bottleneck

First decide whether the wait is compute, memory, communication, or orchestration. Then tie the diagnosis to a trace, counter, or controlled experiment.

BottleneckSignalTypical causeNext experiment
Compute-boundTensor Core utilization is high.dtype, tile shape, kernel library, or batch shapeChange dtype, kernel backend, or problem shape.
Memory-boundHBM throughput high, compute utilization low.Excessive reads/writes or poor localityTry fusion, tiling, FlashAttention, quantization, or layout changes.
Communication-boundNCCL ranges dominate the step timeline.Large buckets, slow topology path, or poor overlapChange rank placement, bucket size, overlap settings, or parallelism axes.
Profiling Methodology

Trace first, then explain the bottleneck

A good performance explanation names the slow path, shows the evidence, and proposes one controlled experiment.

#

Tool split

ToolQuestion it answersEvidence scope
Nsight SystemsWhere does elapsed time go?CPU work, CUDA kernels, memory copies, and NCCL on a timeline.
Nsight ComputeWhy does one kernel behave this way?Occupancy, instruction mix, memory throughput, and bank conflicts.
PyTorch ProfilerWhich framework operation created the work?Operator-to-kernel attribution and high-level traces.

Roofline mental model

Roofline reasoning asks whether arithmetic intensity is high enough to use available compute. If it is not, optimizing memory movement may matter more than adding more math throughput.

Keep workload fixedDo not compare traces with different batch sizes, prompt lengths, output lengths, or data loaders.
First look at SystemsCheck CPU gaps, CUDA launch overhead, NCCL waits, memory copies, and queue time.
Then look at ComputeUse Nsight Compute for occupancy, memory throughput, instruction mix, and bank conflicts.
Change one variableToggle one backend, dtype, bucket size, or batch shape so the evidence is interpretable.
Report residual riskName what the trace does not prove, such as production traffic mix or version-sensitive backend behavior.
SymptomLikely causeTool evidenceNext move
GPU gaps between kernelsdataloader, tokenizer, CPU sync, H2D, or launch overheadNsight Systems CPU/GPU timelineFix input pipeline or remove synchronization.
NCCL dominates step timecommunication-bound parallel layoutnsys trace + topology mappingRevisit rank placement, overlap, and bucket size.
Memory throughput near peakHBM-bound kernel or extra tensor materializationNsight Compute memory metricsTry tiling, fusion, FlashAttention, quantization, or layout changes.
Shared memory conflict highbank conflict or poor tile layoutNCU shared memory bank conflictPad or reshape shared-memory tiles.
Serving P99 highqueueing, long prompts, decode pressure, or KV Cache fragmentationrequest traces + scheduler metricsSeparate prefill and decode metrics before tuning batching.
torchrun --nproc-per-node=8 train.py --config config.yaml
nsys profile --trace=cuda,nvtx,osrt --output=step_report python train.py
ncu --set full --target-processes all python kernel_bench.py
# Serving comparison: keep prompt/output length distributions fixed before comparing engines.
Inference Serving

Prefill, decode, KV Cache, batching, and tail latency

Serving work is not one uniform forward pass. Separate prompt ingestion from token-by-token generation before reasoning about throughput, latency, or cache capacity.

#

Serving loop mental model

A request first enters a scheduler or queue, then prefill computes the prompt and writes KV Cache. Decode repeatedly reads that cache, appends one token step, and streams output while the scheduler admits more work.

Prefill vs Decode

PhaseState transitionDominant metricTypical pressure
PrefillProcesses prompt tokens and builds the initial KV Cache.TTFTCompute and attention work for long prompts.
DecodeReads and appends KV Cache one generated token at a time.TPOTCache capacity, memory bandwidth, and scheduling under load.

KV Cache memory scales with batch * layers * kv_heads * head_dim * sequence_length * 2(K,V) * bytes. That is why request length distribution and cache management are first-order serving concerns.

Formula: serving KV Cache memory 2 * layers * batch * seq_len * kv_heads * head_dim * bytes
  • 2 counts key and value tensors.
  • batch and seq_len capture concurrent requests and cached context length.
  • kv_heads can be smaller than attention heads for MQA/GQA models.
  • bytes depends on FP16/BF16 or cache quantization support.
request arrives
  -> scheduler / queue
  -> prefill builds KV Cache and determines TTFT
  -> decode loop reads KV Cache one step at a time
  -> stream output tokens and track TPOT, P95/P99

PagedAttention, Continuous batching and Prefix cache

PagedAttention-style cache management reduces waste from variable-length requests. Continuous batching improves GPU utilization by mixing requests at different decode positions, but it can increase queueing complexity and tail-latency risk.

Quantization and speculative decoding

Quantization can reduce memory and bandwidth pressure when kernels support the target dtype and accuracy remains acceptable. Speculative decoding helps only when acceptance rate, draft-model cost, and scheduler overhead make the extra machinery worthwhile.

MetricMeaningWhat it tells you
TTFTTime To First TokenPrefill, queueing, and admission behavior.
TPOTTime Per Output TokenDecode loop cost and cache-read pressure.
QPSRequests served per secondAdmission capacity under a specific traffic mix.
tokens/sGenerated tokens per secondThroughput, but only meaningful with prompt/output length distributions.
P95 / P99Tail latency percentilesQueueing, long-context pressure, or scheduling unfairness.
SystemUseful mental modelCheck before claiming
vLLMPagedAttention, continuous batching, and serving scheduler behavior.Version, backend, supported models, and workload shape.
SGLangStructured generation, runtime scheduling, and serving orchestration.Frontend language features, backend engine, and cache behavior.
TensorRT-LLMOptimized inference runtime with engine build and kernel choices.Hardware, dtype, quantization path, and engine build constraints.
How should you explain high serving latency?

Separate queue time, prefill time, decode TPOT, and KV Cache pressure. Then say whether the evidence points to scheduling, memory bandwidth, cache fragmentation, or request-shape skew.

Tradeoff Matrix

What each optimization buys and costs

Use this matrix to avoid generic claims. Each row should connect a concrete bottleneck to a mechanism, cost, and measurement.

#

The most common mistake is naming a technique without naming the state object or resource it changes.

TechniqueProblemMechanismSaves whatCosts whatWhen to useWhen not to useWhat to measure
ZeRO / FSDPTraining state does not fit per GPU.Shard parameters, gradients, and/or optimizer states.Persistent training memoryExtra collectives, materialization peaks, checkpoint complexityLarge training jobs where memory is the limitSmall models where communication dominatespeak memory, NCCL time, step time
Activation checkpointingActivations dominate training memory.Drop selected activations and recompute them in backward.activation memoryextra compute and longer step timeMemory-bound training with spare computeCompute-bound training already near the time budgetpeak memory, step time, recompute overhead
FlashAttentionAttention S x S intermediates and HBM traffic are large.blockwise Q/K/V + online softmaxHBM traffic, attention memorybackend constraints and kernel compatibilityLong context or attention-heavy workloadsUnsupported masks, dtypes, layouts, or hardwareattention time, memory peak, tokens/s
Tensor ParallelOne layer's matrix work is too large for one GPU.Split layer-internal computation across ranks.per-GPU compute and parameter loadhigh-frequency collectivesFast intra-node interconnect such as NVLinkSlow cross-node paths for small layersNCCL time per layer, MFU, step time
Pipeline ParallelModel depth does not fit or scale well on one device group.Split layers into pipeline stages.per-stage model memorybubble, stage imbalance, activation bufferingVery deep models with balanced stagesSmall batch or uneven stage timingbubble ratio, stage time, utilization
SP / CPSequence or context memory is too large.Split sequence/context dimension across ranks.activation/KV/context memoryall-to-all/gather, mask and kernel complexityLong-context training or inferenceShort sequences where overhead dominatesmemory peak, collective time, correctness tests
Expert ParallelMoE experts increase capacity but cannot all run on every rank.Route tokens to experts placed on different ranks.expert parameter and compute placementall-to-all, load imbalance, capacity dropsMoE models with enough token volumeHighly skewed routing without capacity controlall-to-all time, expert load, dropped tokens
QuantizationWeights or KV Cache stress memory and bandwidth.Store or compute with fewer bits when kernels support it.memory, HBM bandwidthaccuracy risk, calibration, kernel supportServing workloads limited by memory trafficQuality-sensitive paths without validationquality eval, tokens/s, TPOT, memory
Continuous batchingGPU idles while requests arrive and finish at different times.Admit new requests into ongoing decode batches.GPU idle time, goodputscheduling complexity and queueing varianceMixed online serving trafficStrict isolation or very predictable single-request trafficQPS, TPOT, P95/P99, queue time
Speculative decodingDecode is slow one token at a time.Draft tokens with a cheaper model and verify with the target model.wall-clock decode timedraft-model cost, acceptance-rate sensitivityHigh acceptance rate with cheap draft modelLow acceptance rate or tight quality constraintsacceptance rate, TPOT, quality
Triton custom kernelFramework kernels leave a clear hot path.Write a workload-specific fused or tiled kernel.kernel launches, HBM traffic, overheadmaintenance and correctness burdenStable hot workload with measurable headroomRapidly changing shapes or fragile numericskernel time, correctness, maintenance cost
Whiteboard Drills

Practice saying the system tradeoff out loud

Use these prompts to turn mechanisms into interview-ready explanations.

#
How do you trace a Transformer block's tensor shapes?

Short answer: start from X(B,S,D), project Q/K/V, split heads to (B,H,S,D/H), form scores (B,H,S,S), then return to (B,S,D).

Deeper explanation: the shape trace tells you where attention memory grows, where FFN parameters live, and which axis a parallel strategy would split.

Pitfall: do not treat the S x S score matrix as a permanent model parameter; it is an activation/intermediate.

Follow-up: connect the shape trace to FlashAttention and KV Cache memory.

Why does FFN parameter count often look like mD^2?

Short answer: a two-layer FFN has roughly 2 * D * Dff parameters; when Dff is a multiple of D, the count is summarized as mD^2.

Deeper explanation: this estimate helps separate parameter memory from activation memory and KV Cache memory.

Pitfall: do not claim parameter count alone predicts training memory; optimizer states and activations can dominate.

Follow-up: explain how embedding size V*D changes the total.

How do you build a memory ledger for training?

Short answer: list weights, gradients, optimizer states, activations, temporary buffers, and communication buckets separately.

Deeper explanation: then mark which items are persistent, which appear only at runtime peaks, and which are replicated or sharded across ranks.

Pitfall: do not hide activation checkpointing, ZeRO/FSDP, and quantization under one generic "memory optimization" label.

Follow-up: estimate which owner dominates before choosing an optimization.

How do DDP and FSDP differ at the state level?

Short answer: DDP replicates parameters and synchronizes gradients; FSDP shards parameters, gradients, and optimizer state around module execution.

Deeper explanation: FSDP saves memory by materializing parameter shards when needed and reducing/scattering gradients during backward.

Pitfall: FSDP can still have runtime peak memory during all-gather or prefetch windows.

Follow-up: name the all-gather and reduce-scatter points in the step timeline.

How do you choose a parallelism layout for 64 GPUs?

Short answer: map the most frequent communication to the fastest links, usually keeping tensor parallelism inside a fast node or NVLink island.

Deeper explanation: choose TP, PP, DP/FSDP, context parallelism, or expert parallelism by naming the state and axis being split.

Pitfall: do not pick degrees before checking topology and bucket/activation traffic.

Follow-up: explain which collectives cross nodes and how that affects failure recovery.

How do you validate a systems claim?

Short answer: state the claim, hold the workload fixed, measure the relevant bottleneck, and change one variable.

Deeper explanation: a good experiment says whether the result is limited by memory bandwidth, compute, communication, scheduler behavior, or request mix.

Pitfall: do not compare tokens/s across different prompt/output distributions and call it a serving win.

Follow-up: show the counter or trace that would falsify your explanation.

How do prefill and decode stress serving differently?

Short answer: prefill handles the prompt and drives TTFT; decode repeatedly reads KV Cache and drives TPOT.

Deeper explanation: batching helps utilization, but longer prompts and retained context increase KV Cache pressure and tail latency.

Pitfall: do not explain serving latency with one average latency number.

Follow-up: separate queue time, prefill time, decode TPOT, and P95/P99.

How do you explain a memory-bound kernel?

Short answer: the kernel moves bytes faster than it uses math units, so improving layout, reuse, or fusion matters more than adding FLOPs.

Deeper explanation: point to HBM throughput, low compute utilization, memory transactions, or shared-memory conflicts.

Pitfall: do not call everything memory-bound just because memory is large; use profiler counters.

Follow-up: suggest one experiment: tiling, fusion, quantization, or FlashAttention depending on the workload.

Interview Practice

Interview Practice

Use these representative prompts to rehearse mechanisms and tradeoffs. The full Q&A lives in the interview section so this handbook stays concept-first.

#
  • What does AI infrastructure optimize beyond model accuracy?
  • How would you debug a slow ML training job?
  • What signals belong in ML observability?
  • What makes LLM infrastructure different from ordinary model serving?
Runtime Extensions

New system branches

Continue from core training and serving mechanisms into specialized runtime domains.

#

RL Infrastructure

Rollout production, policy freshness, placement, distributed checkpointing, and recovery.

Multimodal Serving

Stage orchestration, streamed outputs, heterogeneous batching, and intermediate ownership.

Systems Runtime

RAII, bounded queues, backpressure, transports, NCCL errors, and timeout evidence.

Coding Practice

Executable exercises and calculators for memory, communication, latency, and rollout capacity.

Practical Labs

Annotated Code Reading Labs

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

#
  • Handbook explains the concepts.
  • Labs show the concepts through annotated code.
  • Starter files are optional source-of-truth examples.
  • Running the codeExplain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.

Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

References

Official references and supporting notes

Use official docs and papers for API behavior and version-sensitive claims; use blogs only to improve intuition.

#

Internal pages

External references