Lab 12: 64-GPU Parallelism Design
Annotated code reading lab. Running code is optional.
Distributed Training / Communication
Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
Read code to understand the concept
Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
Core mechanism
- Explain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.
- Read this system as a pipeline graph: inputs become conditions, loaders instantiate model components, the scheduler loop updates latents, and offload or control paths change memory and latency.
- Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
Annotated starter links
These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.
Starter Preview
Excerpt from code/lab-12-64gpu-design/topology_design_worksheet.md. This preview explains the key idea; the linked starter file is the source of truth.
| Axis | Degree | Placement | Reason |
| --- | ---: | --- | --- |
| TP | | | |
| PP | | | |
| DP | | | |
| FSDP/ZeRO | | | |
| Communication | Frequency | Preferred link | Risk |
| --- | --- | --- | --- |
| TP collectives | per layer | | |
| PP send/recv | per stage boundary | | |
| DP all-reduce | per backward bucket | | |
| FSDP all-gather | per wrapped unit | | |Key code blocks
TP row- Forces you to justify whether layer-internal collectives stay inside a fast node domain.
PP row- Separates model depth across stages and makes bubble analysis explicit.
DP row- Represents replicated training paths and global batch growth.
FSDP/ZeRO row- Adds state sharding to the parallelism design.
communication table- Maps each design choice to a collective and link preference.
How to read this code
- A numeric TP/PP/DP tuple is not enough; placement matters.
- Pipeline bubble depends on stages and microbatches.
- FSDP/ZeRO save memory but add all-gather/reduce-scatter paths.
What this code does not mean
- “64 GPUs means DP=64.” That ignores model memory, topology and layer-level communication.
- “PP across nodes is free.” PP reduces some hot collectives but introduces bubbles and send/recv dependencies.
How to say it out loud
For 8 nodes x 8 GPUs, I first decide whether TP can stay within the fastest available interconnect domain for frequent layer collectives. Then I use PP for model depth, DP for replica throughput, and FSDP/ZeRO-style sharding for state memory. I explain the communication map and expected bottleneck rather than only giving degrees.
Additional intuition
- Megatron-LM is the source-code anchor for tensor and pipeline parallel training patterns, but the lab only needs the placement intuition. Source Code: Megatron-LM repository
- DeepSpeed pipeline docs make micro-batches and pipeline bubbles concrete; this is the right mental model before discussing PP degree. Official: DeepSpeed pipeline tutorial
- The Megatron-LM scaling paper is useful for the interview angle: explain tradeoffs among tensor, pipeline and data parallelism instead of naming a single best tuple. Paper: Efficient Large-Scale Language Model Training on GPU Clusters
