# Sequence / Context Parallel

This starter is annotated reading material and the source of truth for the lab preview. Running anything is optional; the reading goal is to explain the mechanism without hiding behind a framework call.

## Reading focus

Read sequence/context parallelism as sharding long-context buffers along the token axis.

## Annotated sketch

```text
## Sequence/context parallel checklist

1. Which tensors are sharded over sequence?
2. Which operation needs full context or a collective?
3. Where are labels/loss gathered or reduced?
4. Which attention backend is required by current docs?
```

## What to explain

- sequence_shard reduces per-rank token length.
- attention may require all-to-all/all-gather patterns.
- loss often needs gathered or reduced results.

## Common trap

- SP/CP is not pipeline parallelism.
- Sharding sequence does not remove causal semantics.