InfraLens

Memory Optimization Reading

This starter is annotated reading material and the source of truth for the lab preview. Running anything is optional; the reading goal is to explain the mechanism without hiding behind a framework call.

Reading focus

Read optimization knobs by asking which tensor/model state they move, shrink or recompute.

Annotated sketch

## Memory optimization map

| Knob | Saves | Cost |
| --- | --- | --- |
| attention backend | attention memory/IO | backend constraints |
| CPU offload | peak GPU memory | transfer latency |
| VAE tiling | decode peak | more scheduling overhead |

What to explain

attention backend changes activation/attention memory.
offload moves modules between CPU/GPU.
vae tiling/slicing reduces decode peak.

Common trap

Do not stack every optimization blindly.
Do not treat offload as a pure speed improvement.

This site is open source. Improve this page.