Concept first, code reading second
Read each mechanism as problem, synchronization pattern, state placement, communication cost, and common misunderstanding. Labs are optional annotated reading material.
A deep-dive branch of the AI Infra umbrella, focused on DDP, collectives, ZeRO/FSDP, tensor/pipeline/sequence/expert parallelism, and topology-aware design.
Read each mechanism as problem, synchronization pattern, state placement, communication cost, and common misunderstanding. Labs are optional annotated reading material.
Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
Distributed training scales training scale beyond single-device memory and compute limits by partitioning inputs, model state, and computation across multiple accelerator ranks.
Single-GPU training runs into memory, compute-time, and input-throughput limits. Fitting large models requires dividing model states, activations, and batches across multiple nodes and chips.
Distributed training splits training work using data parallel, model parallel (tensor, pipeline, sequence), and expert parallel configurations. These strategies communicate via collective operations to synchronize weights, activations, and gradients across process groups.
Explain that distributed scaling is a multi-dimensional tradeoff between memory consumption, compute efficiency, and network communication bandwidth/latency. We choose the parallel strategy based on which resource is the dominant bottleneck.
Calling every multi-GPU method data parallel. In practice, data parallel only shards inputs while replicating the model. Model parallel strategies shard the model state itself, introducing different communication patterns and topologies.
Data Parallel (DDP) replicates the model across ranks while sharding input batches, aggregating gradients at each step using all-reduce collectives.
Training on a single data stream limits sample throughput. Data parallel replicates the entire model across multiple GPUs and feeds each replica a different data batch, allowing linear scaling of training batch size.
In PyTorch DDP, each GPU process processes its own batch in parallel. During the backward pass, DDP registers autograd hooks that trigger gradient synchronization (all-reduce) in background buckets, overlapping communication with backward pass computation.
rank is one participant in a distributed job, world_size is the number of ranks, and a process group is the communicator over which collectives such as all-reduce run.
rank 0 data shard rank 1 data shard rank 2 data shard
| | |
forward/backward forward/backward forward/backward
| | |
grad grad grad
+------ gradient all-reduce -----------+
|
same averaged gradients
|
local optimizer step
Describe gradient bucketing: DDP groups gradients in buckets (reverse order of execution) and initiates all-reduce as soon as a bucket is ready. This hides communication latency behind the remaining backpropagation math.
Assuming DDP scales indefinitely. As GPU count increases, the effective global batch size scales up, which can hurt model convergence unless learning rates are scaled. Also, at high scale, the all-reduce step can become communication-bound.
AllReduce synchronizes gradient state across ranks, ensuring weight updates remain identical without requiring a central coordinator.
Data parallel replicas must update their weights using identical, averaged gradients. AllReduce aggregates local gradients from all ranks and distributes the globally averaged result back to all ranks.
Standard collective implementations use ring or tree topologies. In a ring all-reduce, the tensor is sharded, and ranks pass chunks in a circle. Ranks execute N-1 steps of reduce-scatter (to sum values) and N-1 steps of all-gather (to broadcast the sums).
2 * (N - 1) / N * data
N: number of ranks.data: tensor size being reduced.2 comes from reduce-scatter plus all-gather phases in a ring-style all-reduce.State the bandwidth equation for ring all-reduce: 2 * (N - 1) / N * data. Highlight that the communication volume is independent of the number of nodes at high N, making it highly efficient for massive clusters.
Thinking all-reduce broadcasts all gradients to a central master node. Master-worker aggregation creates a severe network bottleneck at the master. Production setups use peer-to-peer ring or tree collectives over NCCL.
ZeRO and FSDP eliminate redundant memory by sharding parameters, gradients, and optimizer states across the data parallel group.
DDP replicates weights, gradients, and optimizer states on every GPU, which limits the maximum model size to what fits on a single device. ZeRO shards these states to free up memory.
ZeRO shards states: ZeRO-1 shards optimizer states, ZeRO-2 shards optimizer + gradients, and ZeRO-3 (FSDP) shards optimizer + gradients + parameters. FSDP gathers param shards dynamically during the forward pass and discards them after, reducing persistent weight memory.
replicated = params + gradients + optimizer statesZeRO-1 ~= params + gradients + optimizer states / DPZeRO-2 ~= params + gradients / DP + optimizer states / DPZeRO-3 ~= params / DP + gradients / DP + optimizer states / DP
DP: data-parallel degree.Replicated training:
GPU0: params + grads + optimizer
GPU1: params + grads + optimizer
ZeRO/FSDP-style sharding:
GPU0: state shard 0
GPU1: state shard 1
forward may all-gather parameters
backward may reduce-scatter gradients
FSDP-style execution commonly gathers parameter shards when a module needs them and reduces/shards gradients during backward. Exact all-gather timing, prefetch behavior and overlap depend on PyTorch FSDP configuration and version.
Detail the memory math of ZeRO stages. Discuss how FSDP exchanges memory footprint for communication overhead: it replaces a single DDP gradient all-reduce with parameter all-gathers in forward/backward and a gradient reduce-scatter.
Assuming ZeRO-3 has no peak memory issues. While persistent weights are sharded, a layer's parameters must be materialized in full on the GPU during its forward and backward computation, creating temporary memory peaks.
Tensor Parallel partitions matrix weights within a single layer across ranks, splitting wide matrix multiplications to distribute the computational load.
When a single layer's parameter size exceeds a GPU's capacity, we must split the matrix multiplications within that layer across multiple GPUs.
In Megatron-LM style TP, attention and MLP layers are split. Linear layers are partitioned columns-wise (first layer) and row-wise (second layer). This design ensures that columns-wise outputs feed row-wise layers, requiring only two communication steps per block.
Be ready to draw the column-row split diagram. Explain that column-parallel projection requires no immediate communication, but the subsequent row-parallel projection must run an all-reduce to sum the split outputs.
Thinking TP can scale across slow network links. Because TP communicates multiple times per layer, it is extremely sensitive to latency. It must be restricted to GPUs within the same NVLink domain (typically 8 GPUs per node).
Pipeline Parallel partitions model layers sequentially across nodes, using schedules like 1F1B to overlap stage computations and manage activation memory.
When a model has too many layers to fit on a single node's NVLink domain, we partition the layers sequentially across multiple nodes.
PP divides the model layers into sequential stages. To avoid GPipe's massive 'bubble' (idle GPUs waiting for activations), schedules like 1F1B (One Forward, One Backward) interleave forward and backward steps, keeping GPU activation memory stable.
| Schedule | Ordering | Activation and bubble implication | Use in an explanation |
|---|---|---|---|
| GPipe | Runs forward micro-batches before backward micro-batches. | Simple schedule, but stores more activations and exposes a larger bubble. | Use it to introduce pipeline fill and drain. |
| 1F1B | Interleaves one forward and one backward after warmup. | Bounds resident activations more tightly and improves steady-state utilization. | Use it when explaining practical PP memory control. |
Contrast GPipe and 1F1B. Explain that 1F1B limits activation memory to the number of micro-batches in the pipeline depth, whereas GPipe stores activations for the entire batch, leading to memory overhead.
Assuming PP is communication-bound. In fact, PP only communicates boundary activations and gradients between adjacent nodes. It has low communication volume compared to TP, making it ideal for cross-node partitioning.
Sequence and Context Parallel shard the token sequence dimension across ranks, mitigating activation memory peaks during long-context training.
Long-context training generates large activation tensors in attention layers, causing GPU out-of-memory (OOM) failures even when weights are sharded.
Sequence/Context Parallel partitions the sequence dimension across GPUs. Instead of duplicating the sequence inside a TP group, SP shards the activations and uses all-gather or ring-attention to compute attention scores across split sequences.
Explain that Ulysses-style SP uses all-to-all collectives to transpose the sequence dimension to head dimension before attention, while Ring Attention passes KV blocks around a ring of GPUs, matching the attention causal mask.
Thinking SP and TP are mutually exclusive. Ulysses-style sequence parallel is often combined with Tensor Parallelism to scale context length and model width concurrently within the same high-speed interconnect domain.
Expert Parallel shards Mixture-of-Experts layers across ranks, utilizing all-to-all collectives to route tokens to matching expert locations.
Mixture-of-Experts (MoE) scales model parameters without scaling compute costs by routing each token to a subset of specialized expert layers.
A gating network routes tokens to top-k experts. Ranks run all-to-all communication to send tokens to the GPUs holding the chosen experts, execute the expert MLPs in parallel, and run all-to-all to return the processed tokens.
Discuss MoE serving bottlenecks: load balancing is critical because if all tokens route to the same expert, other GPUs sit idle. Discuss capacity factor limits and load-balancing auxiliary loss functions used to enforce routing parity.
Assuming expert parallel is compute-heavy. Since each token only visits a few experts, the FLOPs per token remain low. However, the all-to-all communication creates massive network traffic, making it communication-bandwidth bound.
3D Parallelism integrates TP, PP, and DP/ZeRO axes to scale models to thousands of GPUs across diverse cluster topologies.
Training frontier LLMs with trillions of parameters requires combining multiple parallel axes to optimize memory and compute scaling across thousands of GPUs.
3D Parallelism integrates Data Parallel (DP/ZeRO), Tensor Parallel (TP), and Pipeline Parallel (PP). TP operates within a single GPU node (NVLink), PP partitions layers across nodes, and DP/ZeRO scales throughput across the entire cluster.
| Axis | What it partitions | Preferred placement | Main cost |
|---|---|---|---|
| TP | Matrix computation inside layers. | Fast intra-node GPU links such as NVLink. | Frequent latency-sensitive collectives. |
| PP | Sequential groups of layers. | Across balanced stages and acceptable inter-node links. | Pipeline bubbles and activation transfer. |
| DP / ZeRO | Samples or sharded training state across replicas. | Remaining replica groups across the cluster. | Gradient or state synchronization and checkpointing. |
Describe how to map the 3D grid to hardware topology: TP gets the fastest links (intra-node NVLink), PP gets intermediate connections (inter-node InfiniBand), and DP/ZeRO spans the remaining nodes to parallelize batch processing.
Thinking you can configure 3D parallel indices arbitrarily. A poor layout (e.g. putting TP across different nodes over slow Ethernet) will saturate network links instantly, bringing GPU computing efficiency down to single digits.
Topology-aware placement designs process groups to align communication frequency with hardware physical connection bandwidths.
Collectives must adapt to physical hardware layout. A mismatch between communication patterns and network topology causes severe congestion and slows down training.
Clusters have hierarchical connections: intra-node (NVLink, 900 GB/s), inter-node (InfiniBand/RoCE, 50 GB/s), and inter-rack. Collective libraries like NCCL dynamically build rings or trees to maximize utilization of available link bandwidth.
Discuss topology-aware rank placement. Ranks that communicate frequently (TP, Ulysses SP) must be placed on the same node. Ranks with intermediate traffic (PP, MoE EP) can span nodes, while DP covers the rest.
Assuming standard TCP/IP Ethernet is sufficient for large-scale distributed training. Ordinary TCP has high CPU overhead and latency. High-performance clusters use RDMA (Remote Direct Memory Access) over InfiniBand or RoCE to bypass CPU stacks.
Production interview scenarios evaluate failure diagnoses, performance modeling, and multi-node trace analyses.
Translating distributed concepts into concrete production diagnoses. Interviews test if you can pinpoint why a model OOMs or why scaling efficiency degrades.
Analyze system state using a memory ledger, trace collective traffic patterns, and use roofline models to identify whether a training job is compute-bound, HBM-bandwidth-bound, or network-communication-bound.
Always structure your answers by first naming the state (weights, grads, activations), then identifying the network topology/collective involved, and finally quantifying the performance tradeoff or metric win.
Focusing purely on algorithmic details. Interviewers want to see operational understanding: how to configure bucket sizes, how to diagnose link failures, and how to read Nsight profile traces.
Use these representative prompts to rehearse mechanisms and tradeoffs. The full Q&A lives in the interview section so this handbook stays concept-first.
Parallel layouts become operational systems when they synchronize online policies or serve requests under failure.
Apply sharded learner state and distributed checkpoints to online rollout/update loops.
Connect collectives and topology to timeouts, communicators, admission, and failure diagnosis.
Each lab includes a starter file, key snippets, line-by-line explanation, common misunderstandings, and interview framing.
| # | Lab | Page | Starter |
|---|---|---|---|
| 01 | DDP Gradient Buckets | Open lab | Starter folder |
| 02 | ZeRO / FSDP State Sharding | Open lab | Starter folder |
| 03 | Tensor Parallel Linear | Open lab | Starter folder |
| 04 | Pipeline Schedule Reading | Open lab | Starter folder |
| 05 | Sequence / Context Parallel | Open lab | Starter folder |
| 06 | Expert Parallel Routing | Open lab | Starter folder |
| 07 | 3D Parallelism Plan | Open lab | Starter folder |
| 08 | Topology Communication | Open lab | Starter folder |
| 09 | Interview Scenarios | Open lab | Starter folder |
Use official sources for factual checks and blogs only for supporting intuition.