# Serving Benchmark Template

## Model

- Model:
- Tokenizer:
- Precision / quantization:

## Engine

- Engine:
- Version:
- Tensor parallel size:
- Max model length:
- Other config:

## Input/output length distribution

| Workload | Input tokens | Output tokens | Notes |
| --- | ---: | ---: | --- |
| short-short | | | |
| long-short | | | |
| short-long | | | |
| mixed | | | |

## Concurrency

- Concurrency values:
- Arrival pattern:
- Warmup requests:
- Measurement requests:

## Metrics

| Workload | Concurrency | TTFT | TPOT | QPS | tokens/s | P50 | P95 | P99 | Memory peak |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| | | | | | | | | | |

## Analysis

- Prefill-heavy behavior:
- Decode-heavy behavior:
- KV Cache / memory behavior:
- Tail latency behavior:
- OOM or SLO boundary:
- Recommended setting:
- Rollback condition: