Lab 10 - vLLM Serving Workload Config

Overview

Lab 10: vLLM Serving Workload Config

Annotated code reading lab. Running code is optional.

Related handbook section

Inference Serving

Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.

Inference Serving

Concept Goal

Read code to understand the concept

Mental Model

Core mechanism

Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
A Transformer block turns token ids into vectors, mixes context with attention, applies per-token nonlinear transformations, and uses residual and normalization layers to keep deep training stable.
Explain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.

Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

README benchmark_config.yaml serving_benchmark_plan.md

Annotated Code Preview

Starter Preview

Excerpt from code/lab-10-vllm-serving/benchmark_config.yaml. This preview explains the key idea; the linked starter file is the source of truth.

Open starter file

workloads:
  - name: "short-short"
    input_tokens: 128
    output_tokens: 64
    concurrency: [1, 4, 8, 16]
  - name: "long-short"
    input_tokens: 4096
    output_tokens: 64
    concurrency: [1, 4, 8]

metrics:
  - ttft_ms
  - tpot_ms
  - qps
  - tokens_per_second
  - memory_peak_gb

Line-by-line Explanation

Key code blocks

input_tokens: Controls prefill amount and initial KV creation.
output_tokens: Controls decode length and repeated KV reads.
concurrency: Controls batching opportunity and queueing pressure.
ttft_ms: Time to first token; strongly affected by queueing and prefill.
tpot_ms: Time per output token; reflects decode path behavior.
memory_peak_gb: Captures weights plus KV Cache and runtime buffers.

What to Notice

How to read this code

Serving metrics must be interpreted with length distribution.
Higher concurrency can improve throughput while hurting tail latency.
KV Cache capacity often determines how many long-context requests fit.

Common Misunderstandings

What this code does not mean

“tokens/s alone measures serving quality.” It misses TTFT, TPOT and tail latency.
“QPS is comparable across workloads.” QPS depends heavily on token lengths.

Interview Explanation

How to say it out loud

For serving, I separate prefill and decode. Longer prompts raise TTFT through prefill, while longer outputs stress decode and KV Cache. I report TTFT, TPOT, QPS, tokens/s and P95/P99 together because each captures a different part of the system.

External intuition notes

Additional intuition

The PagedAttention paper is the fact base: the serving problem is often KV-cache memory management and batching, not only matrix multiplication. Paper: PagedAttention / vLLM
vLLM docs are the source for current configuration and feature behavior; avoid assuming support details without checking the current docs. Official: vLLM documentation
Practical vLLM explainers are useful for the analogy: KV Cache pages behave like a memory-management problem, while continuous batching is a scheduler problem. Blog: RunPod vLLM PagedAttention and continuous batching guide

InfraLens