Overview

Distributed Training Interview Practice

Practice the system explanation inside the page.

Part of InfraLens Interview Practice

Use questions to train explanation, not memorization

This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.

#
Reading method

Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.

Map

Question Map

Grouped by the kind of explanation the interview usually asks for.

#
Q&A

Q&A Cards

Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.

01Why does DDP all-reduce gradients instead of parameters?

Short Answer

DDP keeps a full model replica on every rank. Each rank computes gradients on a different data shard, then all-reduces gradient buckets so every rank applies the same averaged update locally.

Deeper Explanation

Synchronizing gradients preserves the same optimization semantics as a larger global batch. Broadcasting parameters after every step would also synchronize replicas, but it gives up the overlap DDP gets by reducing buckets during backward. The important idea is that optimizer state and parameters remain aligned because every rank sees the same reduced gradients.

Common Mistake

Saying DDP sends the whole model after every batch. In normal DDP, gradient buckets are synchronized during backward.

Source / Inspiration: PyTorch DDP docs · GPU quiz inspiration

02What is gradient bucketing in DDP?

Short Answer

Gradient bucketing groups parameter gradients into communication chunks. As soon as a bucket is ready during backward, DDP can start all-reduce and overlap communication with remaining computation.

Deeper Explanation

Without bucketing, synchronization would wait until all gradients are produced, leaving less opportunity to hide communication. Bucket size is a tradeoff: small buckets can start earlier but add overhead; large buckets improve bandwidth efficiency but start later.

Common Mistake

Assuming all gradients are reduced in one final operation after backward completes.

Source / Inspiration: PyTorch DDP docs · NCCL docs

03What changes from ZeRO-1 to ZeRO-3?

Short Answer

ZeRO-1 shards optimizer states, ZeRO-2 also shards gradients, and ZeRO-3 shards parameters as well. Each stage reduces replicated memory at the cost of more communication and orchestration.

StageState shardedTradeoff
ZeRO-1Optimizer states.First persistent-memory reduction with limited added orchestration.
ZeRO-2Optimizer states and gradients.More memory saved with additional backward communication.
ZeRO-3Optimizer states, gradients, and parameters.Largest saving with parameter materialization and checkpoint complexity.

Deeper Explanation

This is useful because optimizer states can be several times larger than the parameters, especially with Adam-like optimizers. FSDP FULL_SHARD-style execution is conceptually close to ZeRO-3 because parameters, gradients and optimizer states can be sharded, but FSDP and DeepSpeed ZeRO are not identical APIs or runtimes.

Common Mistake

Thinking ZeRO is a different training objective. It preserves data-parallel semantics while changing where states live.

Source / Inspiration: PyTorch FSDP docs · External curriculum

04How would you explain FSDP in one minute?

Short Answer

FSDP is PyTorch sharded data-parallel training inspired by ZeRO-3-style state partitioning. Instead of each rank keeping full parameters, gradients and optimizer state resident, ranks commonly materialize needed parameter shards for a unit of work and shard states again afterward.

Deeper Explanation

In common FSDP-style execution, parameter shards are gathered when a module needs them and gradients are reduced/sharded during backward. Exact all-gather timing, prefetch behavior and overlap depend on PyTorch FSDP configuration and version.

Common Mistake

Calling FSDP tensor parallel. Tensor parallel splits computation inside layers; FSDP shards state while preserving data-parallel replicas at a logical level.

Source / Inspiration: PyTorch FSDP docs

05When would you choose tensor parallelism?

Short Answer

Use tensor parallelism when a layer's weights or matrix multiply are too large or slow for one GPU. It splits linear algebra across GPUs; a common heuristic is to map high-frequency TP collectives to the fastest available interconnect domain when possible.

Deeper Explanation

Tensor parallelism adds collectives inside the model's forward and backward pass, so latency and bandwidth matter. It is common for very large Transformers where single-layer computation and memory exceed one device. It is less attractive across slow multi-node networks.

Common Mistake

Using tensor parallelism as the first solution for every multi-GPU job. DDP is simpler when the model fits on one GPU.

Source / Inspiration: NCCL docs · External curriculum

06Why should tensor parallel usually stay inside a fast interconnect domain?

Short Answer

Tensor parallel collectives happen inside layer execution, so they sit on the critical path. A common heuristic is to map them to the fastest available interconnect domain when possible.

Deeper Explanation

Data parallel communication can often be bucketed and overlapped with backward computation. Tensor parallel communication is more tightly coupled with each layer's matrix operations. That is why TP groups are often mapped to GPUs with NVLink or similarly fast fabric, but the final placement depends on model shape, implementation and cluster topology.

Common Mistake

Only counting GPUs and ignoring topology. The same world size can behave very differently depending on placement.

Source / Inspiration: NCCL docs · GPU quiz inspiration

07What problem does pipeline parallelism solve?

Short Answer

Pipeline parallelism splits layers into stages across devices so a model that cannot fit on one GPU can still run. Microbatches keep stages busy instead of executing one full batch serially.

Deeper Explanation

The main tradeoff is pipeline bubble and scheduling complexity. More stages reduce per-device memory but can increase idle time, activation transfer and debugging difficulty. It is often combined with data and tensor parallelism.

Common Mistake

Thinking pipeline parallelism automatically gives linear speedup. Bubbles and imbalance can dominate.

Source / Inspiration: External curriculum · PyTorch distributed overview

08What are sequence or context parallelism for?

Short Answer

They split the sequence dimension or long-context computation across devices. The goal is to reduce activation or attention memory when sequence length becomes the limiting factor.

Deeper Explanation

Long context stresses attention and KV-like intermediate memory. Sequence/context parallelism introduces communication around operations that need full sequence information. The communication pattern is implementation-dependent: some systems use gather/scatter-style collectives, some use attention-specific communication, and some combine them with tensor parallelism.

Common Mistake

Confusing it with data parallelism. Data parallel splits examples; sequence/context parallel splits work inside each example.

Source / Inspiration: External curriculum · Attention paper

09What is expert parallelism in MoE models?

Short Answer

Expert parallelism places different experts on different devices and routes tokens to selected expert shards. It reduces per-token compute while often introducing AllToAll-style exchange; the exact collective pattern depends on MoE implementation, routing strategy, TP layout and runtime.

Deeper Explanation

A sparse MoE model activates only a subset of experts per token. The system challenge is balancing expert load and moving token representations efficiently. Bad routing can cause stragglers even when average compute looks reasonable.

Common Mistake

Only describing MoE as more parameters. The production issue is routing, load balance and communication.

Source / Inspiration: External curriculum · NCCL docs

10How do you combine data, tensor and pipeline parallelism?

Short Answer

Use tensor parallel within the fastest available interconnect domain when possible, pipeline parallel across model stages, and data parallel across replicated groups. The combination is often called 3D parallelism.

Deeper Explanation

The design starts from constraints: model memory, layer compute, sequence length, batch size and network topology. Each axis solves a different problem and introduces different communication. The right answer is topology-aware, not just maximum parallelism.

Common Mistake

Stacking every parallel method by default. Each axis adds operational complexity and communication.

Source / Inspiration: External curriculum · NCCL docs

11What does NCCL do in distributed training?

Short Answer

NCCL provides GPU communication collectives such as all-reduce, reduce-scatter, all-gather and broadcast. Frameworks use it to move tensors efficiently across GPUs and nodes.

Deeper Explanation

NCCL is the communication layer, not the training algorithm. DDP, FSDP and tensor parallel systems call collectives with specific semantics. Topology, transport and tensor sizes affect performance.

Common Mistake

Saying NCCL trains the model. It implements communication primitives used by training frameworks.

Source / Inspiration: NCCL user guide · GPU quiz inspiration

12How do you diagnose communication-bound distributed training?

Short Answer

Look for low compute utilization during collective operations, step time increasing with world size, and profiler traces where all-reduce or all-gather dominates. Then inspect bucket sizes, overlap, topology and network counters.

Deeper Explanation

A communication-bound job may scale well on one node and poorly across nodes. Remedies include changing parallelism strategy, improving overlap, reducing precision or gradient size, increasing batch per GPU, and placing ranks topology-aware.

Common Mistake

Blaming the model architecture before checking profiler timelines and cluster placement.

Source / Inspiration: NVIDIA Nsight Systems · NCCL docs

13Why does global batch size matter in DDP?

Short Answer

DDP usually makes global batch equal to per-rank batch times number of ranks times accumulation steps. Changing it can change optimization behavior, not only throughput.

Deeper Explanation

If you scale world size and keep per-rank batch fixed, the optimizer sees a larger effective batch. Learning rate, warmup and gradient accumulation choices may need adjustment. This is why throughput scaling and training quality have to be evaluated together.

Common Mistake

Assuming more GPUs always means the same training run finishes faster with identical convergence.

Source / Inspiration: PyTorch DDP docs · External curriculum

14What failures are common in multi-node training?

Short Answer

Common failures include rank launch mismatch, network/firewall problems, NCCL timeout, one rank OOM, slow storage, dataloader imbalance and checkpoint corruption. A single failed rank can stall the whole job.

Deeper Explanation

Distributed jobs fail as a group. Good systems expose rank-level logs, collectives timeout information, restart/checkpoint behavior and node placement. Debugging starts by identifying whether failure is deterministic, data-dependent or infrastructure-dependent.

Common Mistake

Only reading rank 0 logs. The failing rank is often elsewhere.

Source / Inspiration: NCCL docs · Observability quiz inspiration

15What is the difference between all-reduce and reduce-scatter plus all-gather?

Short Answer

All-reduce gives every rank the full reduced result. Reduce-scatter gives each rank only a shard of the reduced result, and all-gather later reconstructs full tensors when needed.

Deeper Explanation

This decomposition is important for sharded training. FSDP-style systems avoid keeping full gradients or parameters resident all the time. They communicate shards at the moments required by computation.

Common Mistake

Treating all collectives as interchangeable. Their output placement determines memory behavior.

Source / Inspiration: NCCL collectives · PyTorch FSDP docs

16How would you choose between DDP and FSDP?

Short Answer

Use DDP when the model and optimizer state fit comfortably and you want simplicity and strong throughput. Use FSDP when memory replication blocks training or batch size, accepting more communication and tuning complexity.

MethodState placementChoose whenCost
DDPReplicated model and optimizer state.State fits per GPU and throughput scaling is the goal.Full persistent state on each rank.
FSDPSharded parameters, gradients, and optimizer state.Per-GPU state memory prevents training or useful batches.Gather/scatter traffic and sharded checkpoint handling.

Deeper Explanation

The decision should consider model size, optimizer, activation memory, network speed and engineering maturity. FSDP can enable larger models, but poor wrapping, checkpointing or network placement can reduce throughput.

Common Mistake

Choosing FSDP because it sounds more advanced. If DDP fits and scales, it is often easier to operate.

Source / Inspiration: PyTorch DDP docs · PyTorch FSDP docs

17What does a recoverable sharded learner checkpoint require?

Short Answer

It needs distributed model and optimizer state plus enough runtime metadata to restore the training step and reject duplicated or stale inputs. In an online RL loop, policy versions and consumed-rollout cursors are part of that recovery contract.

Deeper Explanation

Sharded checkpoints should support parallel save/load and load-time resharding when topology changes. Recovery correctness is larger than simply finding model weights on disk.

Common Mistake

Saving only model parameters while losing optimizer state or online-data consumption state.

Source: PyTorch Distributed Checkpoint

Review

Final Review Checklist

Before an interview, you should be able to answer these without reading the page.

#
  • Why does DDP all-reduce gradients instead of parameters?
  • What is gradient bucketing in DDP?
  • What changes from ZeRO-1 to ZeRO-3?
  • How would you explain FSDP in one minute?
  • When would you choose tensor parallelism?
  • Why should tensor parallel usually stay inside a fast interconnect domain?
  • What problem does pipeline parallelism solve?
  • What are sequence or context parallelism for?
Sources

Sources and Further Reading

Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.

#