Lab 06: Expert Parallel Routing
Annotated code reading lab. Running code is optional.
Expert Parallel Routing
This lab maps directly to the handbook section. Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
Expert Parallel Routing
Read MoE routing as token dispatch to expert-owning ranks.
Mechanism to keep in mind
- `router_scores` choose top-k experts.
- `dispatch` often uses AllToAll-style exchange, but the exact collective depends on the MoE implementation and runtime.
- `combine` restores token order after expert compute.
Annotated Code Preview
Open starter fileStarter preview
Excerpt from code/lab-06-expert-parallel-routing/expert_routing.py. The linked starter file is the source of truth.
# Expert Parallel Routing
# Annotated reading material. Running this file is optional.
# Source-of-truth focus: Read MoE routing as token dispatch to expert-owning ranks.
tokens = ["tok0", "tok1", "tok2"]
router_choice = {"tok0": "expert_3", "tok1": "expert_1", "tok2": "expert_3"}
dispatched = "all_to_all_style_exchange(tokens_grouped_by_expert)"
expert_outputs = "experts(dispatched)"
combined = "restore_original_token_order(expert_outputs)"
# What to explain while reading:
# - router_scores choose top-k experts.
# - dispatch often uses AllToAll-style exchange, but exact collectives vary.
# - combine restores token order after expert compute.
#
# Common traps:
# - MoE is not free model scaling.
# - Expert parallelism is not just tensor parallelism.
What each block is doing
- Setup / contract
- `router_scores` choose top-k experts.
- Main transition
- `dispatch` often uses AllToAll-style exchange, but exact collectives vary by implementation.
- Interview hook
- `combine` restores token order after expert compute.
Reading checkpoints
- Expert load balance matters.
- EP increases parameter capacity but adds routing communication.
- Capacity factors and dropped tokens are implementation details to check.
What this lab prevents
- MoE is not free model scaling.
- Expert parallelism is not just tensor parallelism.
How to say it out loud
Read MoE routing as token dispatch to expert-owning ranks. Then explain the code by naming the state being transformed, the axis or shape that matters, and the tradeoff that would appear in a real system.
Additional intuition
- Use official docs and papers for API behavior and factual claims; use blogs only to improve the mental picture.
- If support matrices, performance behavior or backend choices are version-sensitive, check current docs before repeating them.
- A strong interview answer names the state object, the shape or axis it changes, and the tradeoff it creates.
