This starter is annotated reading material and the source of truth for the lab preview. Running anything is optional; the reading goal is to explain the mechanism without hiding behind a framework call.
Read sequence/context parallelism as sharding long-context buffers along the token axis.
## Sequence/context parallel checklist
1. Which tensors are sharded over sequence?
2. Which operation needs full context or a collective?
3. Where are labels/loss gathered or reduced?
4. Which attention backend is required by current docs?