Lab 12 - 64-GPU Parallelism Design

Overview

Lab 12: 64-GPU Parallelism Design

Annotated code reading lab. Running code is optional.

Related handbook section

Distributed Training / Communication

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

Distributed Training Communication

Concept Goal

Read code to understand the concept

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

Mental Model

Core mechanism

Explain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.
Read this system as a pipeline graph: inputs become conditions, loaders instantiate model components, the scheduler loop updates latents, and offload or control paths change memory and latency.
Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

README topology_design_worksheet.md Optional design notes

Annotated Code Preview

Starter Preview

Excerpt from code/lab-12-64gpu-design/topology_design_worksheet.md. This preview explains the key idea; the linked starter file is the source of truth.

Open starter file

| Axis | Degree | Placement | Reason |
| --- | ---: | --- | --- |
| TP | | | |
| PP | | | |
| DP | | | |
| FSDP/ZeRO | | | |

| Communication | Frequency | Preferred link | Risk |
| --- | --- | --- | --- |
| TP collectives | per layer | | |
| PP send/recv | per stage boundary | | |
| DP all-reduce | per backward bucket | | |
| FSDP all-gather | per wrapped unit | | |

Line-by-line Explanation

Key code blocks

TP row: Forces you to justify whether layer-internal collectives stay inside a fast node domain.
PP row: Separates model depth across stages and makes bubble analysis explicit.
DP row: Represents replicated training paths and global batch growth.
FSDP/ZeRO row: Adds state sharding to the parallelism design.
communication table: Maps each design choice to a collective and link preference.

What to Notice

How to read this code

A numeric TP/PP/DP tuple is not enough; placement matters.
Pipeline bubble depends on stages and microbatches.
FSDP/ZeRO save memory but add all-gather/reduce-scatter paths.

Common Misunderstandings

What this code does not mean

“64 GPUs means DP=64.” That ignores model memory, topology and layer-level communication.
“PP across nodes is free.” PP reduces some hot collectives but introduces bubbles and send/recv dependencies.

Interview Explanation

How to say it out loud

For 8 nodes x 8 GPUs, I first decide whether TP can stay within the fastest available interconnect domain for frequent layer collectives. Then I use PP for model depth, DP for replica throughput, and FSDP/ZeRO-style sharding for state memory. I explain the communication map and expected bottleneck rather than only giving degrees.

External intuition notes

Additional intuition

Megatron-LM is the source-code anchor for tensor and pipeline parallel training patterns, but the lab only needs the placement intuition. Source Code: Megatron-LM repository
DeepSpeed pipeline docs make micro-batches and pipeline bubbles concrete; this is the right mental model before discussing PP degree. Official: DeepSpeed pipeline tutorial
The Megatron-LM scaling paper is useful for the interview angle: explain tradeoffs among tensor, pipeline and data parallelism instead of naming a single best tuple. Paper: Efficient Large-Scale Language Model Training on GPU Clusters

InfraLens