# 64-GPU Topology Design Worksheet

This worksheet is a code-reading-style design artifact. You can fill it later,
but first read each table as a prompt for mapping model-parallel concepts to
communication paths. The goal is to explain why a parallelism degree belongs on
a fast intra-node link or can tolerate a slower inter-node link.

## Cluster topology

- Nodes:
- GPUs per node:
- Intra-node GPU interconnect:
- Inter-node network:
- GPU/NIC affinity notes:

## Model assumptions

| Field | Value |
| --- | --- |
| Parameter count | |
| Layers | |
| Hidden size | |
| Sequence length | |
| Microbatch size | |
| Target global batch | |
| Precision | |

## Memory estimate

| Component | Estimate per GPU | Notes |
| --- | ---: | --- |
| Parameters | | |
| Gradients | | |
| Optimizer states | | |
| Activations | | |
| Communication buffers | | |

## Parallelism choice

| Axis | Degree | Placement | Reason |
| --- | ---: | --- | --- |
| TP | | | Usually kept inside a node when layer-internal collectives are hot |
| PP | | | Can split depth across nodes but introduces pipeline bubble |
| DP | | | Replicates the model path and grows global batch |
| FSDP/ZeRO | | | Shards training state and adds all-gather/reduce-scatter |
| SP/CP | | | Splits sequence/context work for long-context pressure |
| EP/MoE | | | Routes tokens to experts and can introduce all-to-all |

## Communication map

| Communication | Frequency | Preferred link | Risk |
| --- | --- | --- | --- |
| TP collectives | per layer | fast intra-node if possible | stalls every layer if placed on slow links |
| PP send/recv | per stage boundary | can cross nodes if balanced | bubble and stage imbalance |
| DP all-reduce | per backward bucket | overlap-friendly fabric | late buckets can extend step time |
| FSDP all-gather | per wrapped unit | near compute if frequent | parameter materialization spikes |
| FSDP reduce-scatter | backward | bandwidth-sensitive | delayed gradient release |
| MoE all-to-all | per MoE layer | topology-aware groups | token routing imbalance |

## Pipeline bubble and microbatching

- Number of pipeline stages:
- Number of microbatches:
- Expected bubble risk:
- Stage imbalance risk:

## Checkpoint strategy

- Full or sharded checkpoint:
- Save frequency:
- Resume complexity:
- Conversion needs:

## Prompts for final memo

- Why is TP placed there?
- What is the hottest collective?
- How does global batch change?
- What is the first bottleneck you expect?
- What is the fallback design if communication dominates?