Lab 03 - DDP Conversion

Overview

Lab 03: DDP Conversion

Annotated code reading lab. Running code is optional.

Related handbook section

Distributed Training / Communication

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

Training Communication

Concept Goal

Read code to understand the concept

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

Mental Model

Core mechanism

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

README train_ddp.py Optional note template

Annotated Code Preview

Starter Preview

Excerpt from code/lab-03-ddp-conversion/train_ddp.py. This preview explains the key idea; the linked starter file is the source of truth.

Open starter file

dist.init_process_group(backend="nccl" if torch.cuda.is_available() else "gloo")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

model = DDP(model.cuda(), device_ids=[local_rank])

# DDP autograd hooks run during backward. Gradient buckets are all-reduced
# inside the process group so every rank applies the same averaged update.
loss.backward()
optimizer.step()

if rank == 0:
    torch.save({"model": model.module.state_dict()}, args.checkpoint)

Line-by-line Explanation

Key code blocks

init_process_group: Creates the distributed communication context. DDP collectives use this context.
LOCAL_RANK: Local device index on the current node, usually used to bind one process to one GPU.
DDP(...): Wraps the module and installs synchronization logic around gradient computation.
loss.backward: The apparent local backward call is also where DDP gradient all-reduce happens.
rank == 0: Prevents every rank from writing the same checkpoint path.

What to Notice

How to read this code

DDP does not shard model parameters; each rank has a full copy.
Gradient synchronization happens during backward, not in optimizer.step.
Global batch is per-rank batch multiplied by world size unless you adjust it.

Common Misunderstandings

What this code does not mean

“DDP automatically makes the model fit in memory.” DDP replicates the model, so it does not reduce per-GPU model memory.
“AllReduce is a separate explicit call in user code.” DDP hides it behind autograd hooks.

Interview Explanation

How to say it out loud

DDP launches one process per GPU. Each process joins a process group, builds the same model, processes a different data shard, and during backward DDP all-reduces gradient buckets so every rank applies the same update. Rank 0 usually owns checkpoint writing.

External intuition notes

Additional intuition

The DDP tutorial frames the important hidden behavior: backward hooks trigger gradient synchronization across processes. Official: PyTorch DDP tutorial
DDP communication hook docs make bucket overlap concrete: gradients are bucketized so communication can overlap backward computation. Official: PyTorch DDP communication hooks
NCCL collectives docs are the vocabulary checkpoint: all-reduce, all-gather and reduce-scatter are communication patterns, not PyTorch-only ideas. Official: NCCL collectives

InfraLens