InfraLens

A clear starting point for learning AI infrastructure.

Overview

Lab 03: DDP Conversion

Annotated code reading lab. Running code is optional.

Concept Goal

Read code to understand the concept

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

Mental Model

Core mechanism

  • Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

Annotated Code Preview

Starter Preview

Excerpt from code/lab-03-ddp-conversion/train_ddp.py. This preview explains the key idea; the linked starter file is the source of truth.

dist.init_process_group(backend="nccl" if torch.cuda.is_available() else "gloo")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

model = DDP(model.cuda(), device_ids=[local_rank])

# DDP autograd hooks run during backward. Gradient buckets are all-reduced
# inside the process group so every rank applies the same averaged update.
loss.backward()
optimizer.step()

if rank == 0:
    torch.save({"model": model.module.state_dict()}, args.checkpoint)
Line-by-line Explanation

Key code blocks

init_process_group
Creates the distributed communication context. DDP collectives use this context.
LOCAL_RANK
Local device index on the current node, usually used to bind one process to one GPU.
DDP(...)
Wraps the module and installs synchronization logic around gradient computation.
loss.backward
The apparent local backward call is also where DDP gradient all-reduce happens.
rank == 0
Prevents every rank from writing the same checkpoint path.
What to Notice

How to read this code

  • DDP does not shard model parameters; each rank has a full copy.
  • Gradient synchronization happens during backward, not in optimizer.step.
  • Global batch is per-rank batch multiplied by world size unless you adjust it.
Common Misunderstandings

What this code does not mean

  • “DDP automatically makes the model fit in memory.” DDP replicates the model, so it does not reduce per-GPU model memory.
  • “AllReduce is a separate explicit call in user code.” DDP hides it behind autograd hooks.
Interview Explanation

How to say it out loud

DDP launches one process per GPU. Each process joins a process group, builds the same model, processes a different data shard, and during backward DDP all-reduces gradient buckets so every rank applies the same update. Rank 0 usually owns checkpoint writing.

External intuition notes

Additional intuition

  • The DDP tutorial frames the important hidden behavior: backward hooks trigger gradient synchronization across processes. Official: PyTorch DDP tutorial
  • DDP communication hook docs make bucket overlap concrete: gradients are bucketized so communication can overlap backward computation. Official: PyTorch DDP communication hooks
  • NCCL collectives docs are the vocabulary checkpoint: all-reduce, all-gather and reduce-scatter are communication patterns, not PyTorch-only ideas. Official: NCCL collectives
Further Reading

Official, paper and practical references