# Memory Optimization Reading

This starter is annotated reading material and the source of truth for the lab preview. Running anything is optional; the reading goal is to explain the mechanism without hiding behind a framework call.

## Reading focus

Read optimization knobs by asking which tensor/model state they move, shrink or recompute.

## Annotated sketch

```text
## Memory optimization map

| Knob | Saves | Cost |
| --- | --- | --- |
| attention backend | attention memory/IO | backend constraints |
| CPU offload | peak GPU memory | transfer latency |
| VAE tiling | decode peak | more scheduling overhead |
```

## What to explain

- attention backend changes activation/attention memory.
- offload moves modules between CPU/GPU.
- vae tiling/slicing reduces decode peak.

## Common trap

- Do not stack every optimization blindly.
- Do not treat offload as a pure speed improvement.