Foundation
2 representative questions.
Practice the system explanation inside the page.
This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.
Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.
Grouped by the kind of explanation the interview usually asks for.
2 representative questions.
6 representative questions.
2 representative questions.
2 representative questions.
4 representative questions.
Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.
DDP keeps a full model replica on every rank. Each rank computes gradients on a different data shard, then all-reduces gradient buckets so every rank applies the same averaged update locally.
Synchronizing gradients preserves the same optimization semantics as a larger global batch. Broadcasting parameters after every step would also synchronize replicas, but it gives up the overlap DDP gets by reducing buckets during backward. The important idea is that optimizer state and parameters remain aligned because every rank sees the same reduced gradients.
Saying DDP sends the whole model after every batch. In normal DDP, gradient buckets are synchronized during backward.
Source / Inspiration: PyTorch DDP docs · GPU quiz inspiration
Gradient bucketing groups parameter gradients into communication chunks. As soon as a bucket is ready during backward, DDP can start all-reduce and overlap communication with remaining computation.
Without bucketing, synchronization would wait until all gradients are produced, leaving less opportunity to hide communication. Bucket size is a tradeoff: small buckets can start earlier but add overhead; large buckets improve bandwidth efficiency but start later.
Assuming all gradients are reduced in one final operation after backward completes.
Source / Inspiration: PyTorch DDP docs · NCCL docs
ZeRO-1 shards optimizer states, ZeRO-2 also shards gradients, and ZeRO-3 shards parameters as well. Each stage reduces replicated memory at the cost of more communication and orchestration.
| Stage | State sharded | Tradeoff |
|---|---|---|
| ZeRO-1 | Optimizer states. | First persistent-memory reduction with limited added orchestration. |
| ZeRO-2 | Optimizer states and gradients. | More memory saved with additional backward communication. |
| ZeRO-3 | Optimizer states, gradients, and parameters. | Largest saving with parameter materialization and checkpoint complexity. |
This is useful because optimizer states can be several times larger than the parameters, especially with Adam-like optimizers. FSDP FULL_SHARD-style execution is conceptually close to ZeRO-3 because parameters, gradients and optimizer states can be sharded, but FSDP and DeepSpeed ZeRO are not identical APIs or runtimes.
Thinking ZeRO is a different training objective. It preserves data-parallel semantics while changing where states live.
Source / Inspiration: PyTorch FSDP docs · External curriculum
FSDP is PyTorch sharded data-parallel training inspired by ZeRO-3-style state partitioning. Instead of each rank keeping full parameters, gradients and optimizer state resident, ranks commonly materialize needed parameter shards for a unit of work and shard states again afterward.
In common FSDP-style execution, parameter shards are gathered when a module needs them and gradients are reduced/sharded during backward. Exact all-gather timing, prefetch behavior and overlap depend on PyTorch FSDP configuration and version.
Calling FSDP tensor parallel. Tensor parallel splits computation inside layers; FSDP shards state while preserving data-parallel replicas at a logical level.
Source / Inspiration: PyTorch FSDP docs
Use tensor parallelism when a layer's weights or matrix multiply are too large or slow for one GPU. It splits linear algebra across GPUs; a common heuristic is to map high-frequency TP collectives to the fastest available interconnect domain when possible.
Tensor parallelism adds collectives inside the model's forward and backward pass, so latency and bandwidth matter. It is common for very large Transformers where single-layer computation and memory exceed one device. It is less attractive across slow multi-node networks.
Using tensor parallelism as the first solution for every multi-GPU job. DDP is simpler when the model fits on one GPU.
Source / Inspiration: NCCL docs · External curriculum
Tensor parallel collectives happen inside layer execution, so they sit on the critical path. A common heuristic is to map them to the fastest available interconnect domain when possible.
Data parallel communication can often be bucketed and overlapped with backward computation. Tensor parallel communication is more tightly coupled with each layer's matrix operations. That is why TP groups are often mapped to GPUs with NVLink or similarly fast fabric, but the final placement depends on model shape, implementation and cluster topology.
Only counting GPUs and ignoring topology. The same world size can behave very differently depending on placement.
Source / Inspiration: NCCL docs · GPU quiz inspiration
Pipeline parallelism splits layers into stages across devices so a model that cannot fit on one GPU can still run. Microbatches keep stages busy instead of executing one full batch serially.
The main tradeoff is pipeline bubble and scheduling complexity. More stages reduce per-device memory but can increase idle time, activation transfer and debugging difficulty. It is often combined with data and tensor parallelism.
Thinking pipeline parallelism automatically gives linear speedup. Bubbles and imbalance can dominate.
Source / Inspiration: External curriculum · PyTorch distributed overview
They split the sequence dimension or long-context computation across devices. The goal is to reduce activation or attention memory when sequence length becomes the limiting factor.
Long context stresses attention and KV-like intermediate memory. Sequence/context parallelism introduces communication around operations that need full sequence information. The communication pattern is implementation-dependent: some systems use gather/scatter-style collectives, some use attention-specific communication, and some combine them with tensor parallelism.
Confusing it with data parallelism. Data parallel splits examples; sequence/context parallel splits work inside each example.
Source / Inspiration: External curriculum · Attention paper
Expert parallelism places different experts on different devices and routes tokens to selected expert shards. It reduces per-token compute while often introducing AllToAll-style exchange; the exact collective pattern depends on MoE implementation, routing strategy, TP layout and runtime.
A sparse MoE model activates only a subset of experts per token. The system challenge is balancing expert load and moving token representations efficiently. Bad routing can cause stragglers even when average compute looks reasonable.
Only describing MoE as more parameters. The production issue is routing, load balance and communication.
Source / Inspiration: External curriculum · NCCL docs
Use tensor parallel within the fastest available interconnect domain when possible, pipeline parallel across model stages, and data parallel across replicated groups. The combination is often called 3D parallelism.
The design starts from constraints: model memory, layer compute, sequence length, batch size and network topology. Each axis solves a different problem and introduces different communication. The right answer is topology-aware, not just maximum parallelism.
Stacking every parallel method by default. Each axis adds operational complexity and communication.
Source / Inspiration: External curriculum · NCCL docs
NCCL provides GPU communication collectives such as all-reduce, reduce-scatter, all-gather and broadcast. Frameworks use it to move tensors efficiently across GPUs and nodes.
NCCL is the communication layer, not the training algorithm. DDP, FSDP and tensor parallel systems call collectives with specific semantics. Topology, transport and tensor sizes affect performance.
Saying NCCL trains the model. It implements communication primitives used by training frameworks.
Source / Inspiration: NCCL user guide · GPU quiz inspiration
Look for low compute utilization during collective operations, step time increasing with world size, and profiler traces where all-reduce or all-gather dominates. Then inspect bucket sizes, overlap, topology and network counters.
A communication-bound job may scale well on one node and poorly across nodes. Remedies include changing parallelism strategy, improving overlap, reducing precision or gradient size, increasing batch per GPU, and placing ranks topology-aware.
Blaming the model architecture before checking profiler timelines and cluster placement.
Source / Inspiration: NVIDIA Nsight Systems · NCCL docs
DDP usually makes global batch equal to per-rank batch times number of ranks times accumulation steps. Changing it can change optimization behavior, not only throughput.
If you scale world size and keep per-rank batch fixed, the optimizer sees a larger effective batch. Learning rate, warmup and gradient accumulation choices may need adjustment. This is why throughput scaling and training quality have to be evaluated together.
Assuming more GPUs always means the same training run finishes faster with identical convergence.
Source / Inspiration: PyTorch DDP docs · External curriculum
Common failures include rank launch mismatch, network/firewall problems, NCCL timeout, one rank OOM, slow storage, dataloader imbalance and checkpoint corruption. A single failed rank can stall the whole job.
Distributed jobs fail as a group. Good systems expose rank-level logs, collectives timeout information, restart/checkpoint behavior and node placement. Debugging starts by identifying whether failure is deterministic, data-dependent or infrastructure-dependent.
Only reading rank 0 logs. The failing rank is often elsewhere.
Source / Inspiration: NCCL docs · Observability quiz inspiration
All-reduce gives every rank the full reduced result. Reduce-scatter gives each rank only a shard of the reduced result, and all-gather later reconstructs full tensors when needed.
This decomposition is important for sharded training. FSDP-style systems avoid keeping full gradients or parameters resident all the time. They communicate shards at the moments required by computation.
Treating all collectives as interchangeable. Their output placement determines memory behavior.
Source / Inspiration: NCCL collectives · PyTorch FSDP docs
Use DDP when the model and optimizer state fit comfortably and you want simplicity and strong throughput. Use FSDP when memory replication blocks training or batch size, accepting more communication and tuning complexity.
| Method | State placement | Choose when | Cost |
|---|---|---|---|
| DDP | Replicated model and optimizer state. | State fits per GPU and throughput scaling is the goal. | Full persistent state on each rank. |
| FSDP | Sharded parameters, gradients, and optimizer state. | Per-GPU state memory prevents training or useful batches. | Gather/scatter traffic and sharded checkpoint handling. |
The decision should consider model size, optimizer, activation memory, network speed and engineering maturity. FSDP can enable larger models, but poor wrapping, checkpointing or network placement can reduce throughput.
Choosing FSDP because it sounds more advanced. If DDP fits and scales, it is often easier to operate.
Source / Inspiration: PyTorch DDP docs · PyTorch FSDP docs
It needs distributed model and optimizer state plus enough runtime metadata to restore the training step and reject duplicated or stale inputs. In an online RL loop, policy versions and consumed-rollout cursors are part of that recovery contract.
Sharded checkpoints should support parallel save/load and load-time resharding when topology changes. Recovery correctness is larger than simply finding model weights on disk.
Saving only model parameters while losing optimizer state or online-data consumption state.
Source: PyTorch Distributed Checkpoint
Before an interview, you should be able to answer these without reading the page.
Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.