# Serving Benchmark Plan

## Objective

Understand LLM serving behavior across prefill-heavy, decode-heavy and mixed
workloads. This is a reading plan first: the matrix explains why average
tokens/s alone is not enough to describe serving behavior.

## Fields to identify before any future benchmark

- Model:
- Tokenizer:
- Engine:
- Precision / quantization:
- GPU:
- Sampling parameters:
- Warmup requests:
- Measurement window:

## Workload matrix

| Case | Input length | Output length | Concurrency | Purpose |
| --- | ---: | ---: | ---: | --- |
| short-short | 128 | 64 | 1, 4, 8, 16 | baseline latency |
| long-short | 4096 | 64 | 1, 4, 8 | prefill pressure |
| short-long | 128 | 1024 | 1, 4, 8 | decode and KV pressure |
| mixed | distribution | distribution | 8, 16, 32 | scheduler behavior |

## Metrics

- TTFT: time to first token, sensitive to queueing and prefill.
- TPOT: time per output token, sensitive to decode and KV Cache reads.
- QPS: request throughput, only meaningful with length distribution.
- tokens/s: token throughput.
- P50/P95/P99: tail latency and SLO risk.
- Memory peak: weights + KV Cache + runtime buffers.
- OOM boundary: max concurrency or context before failure.

## Analysis prompts

- Which workload is prefill-heavy?
- Which workload is decode-heavy?
- Does higher concurrency improve tokens/s while hurting P99?
- Does memory peak track KV Cache growth?
- What is the rollback condition for a serving optimization?