InfraLens

Sequence / Context Parallel

This starter is annotated reading material and the source of truth for the lab preview. Running anything is optional; the reading goal is to explain the mechanism without hiding behind a framework call.

Reading focus

Read sequence/context parallelism as sharding long-context buffers along the token axis.

Annotated sketch

## Sequence/context parallel checklist

1. Which tensors are sharded over sequence?
2. Which operation needs full context or a collective?
3. Where are labels/loss gathered or reduced?
4. Which attention backend is required by current docs?

What to explain

Common trap