Tagged “#llm-inference”
-
A Token's Journey Across the LLM Stack
Part 4 of The Life of a Token Across the LLM Stack: a synthesis of token, compute, memory, and request lifecycles for diagnosing LLM inference systems.
-
The Life of a Request Inside vLLM
Part 3 of The Life of a Token Across the LLM Stack: a serving-level walkthrough of how vLLM-style engines schedule prefill, decode, continuous batching, KV cache, and streamed tokens.
-
Why Transformer Performance Is Mostly a Matmul Problem
Part 2 of The Life of a Token Across the LLM Stack: a kernel-level explanation of how Transformer linear layers become GPU matmul, tiling, memory movement, and tensor-core execution.
-
The Life of a Token Inside a Transformer
Part 1 of The Life of a Token Across the LLM Stack: a systems-oriented explanation of how tokens move through embeddings, attention, MLPs, positional encoding, KV cache, and logits.
-
LLM Inference Is a Full-Stack Systems Problem
Part 0 of The Life of a Token Across the LLM Stack: an entry-point map of LLM inference across Transformer computation, GPU kernels, KV cache, and serving systems.
See all tags.