InfraLens

A clear starting point for learning AI infrastructure.

Overview

Lab 12: 64-GPU Parallelism Design

Annotated code reading lab. Running code is optional.

Concept Goal

Read code to understand the concept

Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.

Mental Model

Core mechanism

  • Explain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.
  • Read this system as a pipeline graph: inputs become conditions, loaders instantiate model components, the scheduler loop updates latents, and offload or control paths change memory and latency.
  • Distributed training scales beyond one device by partitioning data, model state, or computation across ranks. The key questions are what is replicated, what is sharded, which collective runs on the critical path, and how optimizer semantics stay consistent.
Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

Annotated Code Preview

Starter Preview

Excerpt from code/lab-12-64gpu-design/topology_design_worksheet.md. This preview explains the key idea; the linked starter file is the source of truth.

| Axis | Degree | Placement | Reason |
| --- | ---: | --- | --- |
| TP | | | |
| PP | | | |
| DP | | | |
| FSDP/ZeRO | | | |

| Communication | Frequency | Preferred link | Risk |
| --- | --- | --- | --- |
| TP collectives | per layer | | |
| PP send/recv | per stage boundary | | |
| DP all-reduce | per backward bucket | | |
| FSDP all-gather | per wrapped unit | | |
Line-by-line Explanation

Key code blocks

TP row
Forces you to justify whether layer-internal collectives stay inside a fast node domain.
PP row
Separates model depth across stages and makes bubble analysis explicit.
DP row
Represents replicated training paths and global batch growth.
FSDP/ZeRO row
Adds state sharding to the parallelism design.
communication table
Maps each design choice to a collective and link preference.
What to Notice

How to read this code

  • A numeric TP/PP/DP tuple is not enough; placement matters.
  • Pipeline bubble depends on stages and microbatches.
  • FSDP/ZeRO save memory but add all-gather/reduce-scatter paths.
Common Misunderstandings

What this code does not mean

  • “64 GPUs means DP=64.” That ignores model memory, topology and layer-level communication.
  • “PP across nodes is free.” PP reduces some hot collectives but introduces bubbles and send/recv dependencies.
Interview Explanation

How to say it out loud

For 8 nodes x 8 GPUs, I first decide whether TP can stay within the fastest available interconnect domain for frequent layer collectives. Then I use PP for model depth, DP for replica throughput, and FSDP/ZeRO-style sharding for state memory. I explain the communication map and expected bottleneck rather than only giving degrees.

External intuition notes

Additional intuition

Further Reading

Official, paper and practical references