InfraLens

Sequence / Context Parallel

This starter is annotated reading material and the source of truth for the lab preview. Running anything is optional; the reading goal is to explain the mechanism without hiding behind a framework call.

Reading focus

Read sequence/context parallelism as sharding long-context buffers along the token axis.

Annotated sketch

## Sequence/context parallel checklist

Which tensors are sharded over sequence?
Which operation needs full context or a collective?
Where are labels/loss gathered or reduced?
Which attention backend is required by current docs?

What to explain

sequence_shard reduces per-rank token length.
attention may require all-to-all/all-gather patterns.
loss often needs gathered or reduced results.

Common trap

SP/CP is not pipeline parallelism.
Sharding sequence does not remove causal semantics.

This site is open source. Improve this page.