Tagged '#llm-inference'

Reasoning Effort Is an Inference-Time Systems Knob

Reasoning effort changes hidden token use, latency, cost, context pressure, request residency, and serving capacity across the LLM inference stack.

22 Jul 2026 · 16 min read

AI Infrastructure Is Not Plumbing

Why AI infrastructure determines whether experiments are trustworthy, inference is economical, and AI products remain reliable under real workloads.

13 Jul 2026 · 22 min read

A Token's Journey Across the LLM Stack

A unified diagnostic model connecting token computation, GPU kernels, KV-cache memory, request scheduling, latency, and throughput across LLM inference.

06 Jun 2026 · Updated 13 Jul 2026 · 17 min read

The Life of a Request Inside vLLM

How vLLM-style serving engines coordinate prefill, decode, continuous batching, paged KV cache, scheduling, speculative decoding, and streaming.

04 Jun 2026 · Updated 13 Jul 2026 · 18 min read

Why Transformer Performance Is Mostly a Matmul Problem

How Transformer linear layers become tiled GPU matrix multiplications shaped by memory movement, tensor-core utilization, batch shape, and arithmetic intensity.

02 Jun 2026 · Updated 13 Jul 2026 · 17 min read

The Life of a Token Inside a Transformer

How token IDs become contextual representations through embeddings, normalization, attention, MLPs, positional encoding, KV cache, and logits.

31 May 2026 · Updated 13 Jul 2026 · 17 min read

LLM Inference Is a Full-Stack Systems Problem

A full-stack map of LLM inference across Transformer computation, GPU kernels, KV-cache memory, request scheduling, and streamed token delivery.

29 May 2026 · Updated 13 Jul 2026 · 16 min read

Tagged “#llm-inference”