Why is inference slow?
Separate prefill, decode, KV cache reads, batching, and speculative decoding before blaming the model.
InfraLens helps you build a first map of memory, latency, parallelism, quantization, and token flow — with paths, formulas, calculators, and small code examples when you want to try things yourself.
Built for exploration, interview preparation, and technical research — not to cover everything, but to give you a clear place to start.
Each card gives you a short route instead of dropping you into a long index.
Separate prefill, decode, KV cache reads, batching, and speculative decoding before blaming the model.
Break memory into weights, gradients, optimizer states, activations, and communication buffers.
Look at positional behavior, KV cache growth, attention cost, and serving latency as separate pressures.
Follow the cost from image tokens to latent grids, temporal patches, denoising steps, and control paths.
Use these routes when you want a topic sequence rather than a concept lookup.
Look up a mechanism, formula, pitfall, or code example in context.
Browse conceptsEstimate KV cache, attention memory, video tokens, and quantized weights.
Try calculatorsSee how concepts connect when you know one term but not the surrounding system.
Open concept map