InfraLens

Memory Optimization Reading

This starter is annotated reading material and the source of truth for the lab preview. Running anything is optional; the reading goal is to explain the mechanism without hiding behind a framework call.

Reading focus

Read optimization knobs by asking which tensor/model state they move, shrink or recompute.

Annotated sketch

## Memory optimization map

| Knob | Saves | Cost |
| --- | --- | --- |
| attention backend | attention memory/IO | backend constraints |
| CPU offload | peak GPU memory | transfer latency |
| VAE tiling | decode peak | more scheduling overhead |

What to explain

Common trap