InfraLens

A clear starting point for learning AI infrastructure.

Overview

Lab 06: Expert Parallel Routing

Annotated code reading lab. Running code is optional.

Related handbook section

Expert Parallel Routing

This lab maps directly to the handbook section. Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Concept Goal

Expert Parallel Routing

Read MoE routing as token dispatch to expert-owning ranks.

Mental Model

Mechanism to keep in mind

  • `router_scores` choose top-k experts.
  • `dispatch` often uses AllToAll-style exchange, but the exact collective depends on the MoE implementation and runtime.
  • `combine` restores token order after expert compute.
Annotated Code Preview

Starter preview

Excerpt from code/lab-06-expert-parallel-routing/expert_routing.py. The linked starter file is the source of truth.

Open starter file
# Expert Parallel Routing
# Annotated reading material. Running this file is optional.
# Source-of-truth focus: Read MoE routing as token dispatch to expert-owning ranks.

tokens = ["tok0", "tok1", "tok2"]
router_choice = {"tok0": "expert_3", "tok1": "expert_1", "tok2": "expert_3"}
dispatched = "all_to_all_style_exchange(tokens_grouped_by_expert)"
expert_outputs = "experts(dispatched)"
combined = "restore_original_token_order(expert_outputs)"

# What to explain while reading:
# - router_scores choose top-k experts.
# - dispatch often uses AllToAll-style exchange, but exact collectives vary.
# - combine restores token order after expert compute.
#
# Common traps:
# - MoE is not free model scaling.
# - Expert parallelism is not just tensor parallelism.
Line-by-line Explanation

What each block is doing

Setup / contract
`router_scores` choose top-k experts.
Main transition
`dispatch` often uses AllToAll-style exchange, but exact collectives vary by implementation.
Interview hook
`combine` restores token order after expert compute.
What to Notice

Reading checkpoints

  • Expert load balance matters.
  • EP increases parameter capacity but adds routing communication.
  • Capacity factors and dropped tokens are implementation details to check.
Common Misunderstandings

What this lab prevents

  • MoE is not free model scaling.
  • Expert parallelism is not just tensor parallelism.
Interview Explanation

How to say it out loud

Read MoE routing as token dispatch to expert-owning ranks. Then explain the code by naming the state being transformed, the axis or shape that matters, and the tradeoff that would appear in a real system.

External intuition notes

Additional intuition

  • Use official docs and papers for API behavior and factual claims; use blogs only to improve the mental picture.
  • If support matrices, performance behavior or backend choices are version-sensitive, check current docs before repeating them.
  • A strong interview answer names the state object, the shape or axis it changes, and the tradeoff it creates.
Further Reading

Official, paper and practical references