InfraLens

Attention Backend Comparison Reading

This starter is annotated reading material and the source of truth for the lab preview. Running anything is optional; the reading goal is to explain the mechanism without hiding behind a framework call.

Reading focus

Compare math-level attention with kernel/runtime choices: SDPA, FlashAttention and PagedAttention.

Annotated sketch

## Backend comparison

| Layer | Question | Reading focus |
| --- | --- | --- |
| Math | What is computed? | QK^T, mask, softmax, V |
| Kernel | How is it computed? | SDPA / FlashAttention memory traffic |
| Runtime | How is state managed? | KV blocks, batching, paging |

What to explain

math names the formula.
kernel names how the formula is executed.
runtime names serving-time cache/request management.

Common trap

Do not call FlashAttention an approximation.
Do not confuse kernel backend with serving scheduler.

This site is open source. Improve this page.