# Lab 10: vLLM Serving Workload Config

This lab is config-reading material. vLLM or another serving engine is optional
for a later run; the config is a vocabulary for reasoning about serving workload
shape.

## Reading focus

- `input_tokens` maps to prefill work and initial KV Cache creation.
- `output_tokens` maps to decode length and repeated KV reads.
- `concurrency` maps to batching opportunity, queueing pressure and KV capacity.
- `ttft_ms`, `tpot_ms`, `qps`, `tokens_per_second` and P95/P99 describe different serving questions.
- `gpu_memory_utilization` is a capacity knob, not a correctness guarantee.

## Files

- `serving_benchmark_plan.md`: concepts and workload matrix for serving analysis.
- `benchmark_config.yaml`: placeholder config for reading model, workload and metric fields.

## Questions to answer while reading

- Which field makes prefill heavier?
- Which field makes decode longer?
- Why can QPS improve while P99 gets worse?
