# Quantization Comparison Plan

## Objective

Read the table as a concept map for quantization tradeoffs. The goal is not to
claim one mode is always faster; it is to understand what each mode saves, what
it may damage, and why kernel support decides whether the theoretical saving
turns into serving speed.

## Configurations

| Config | Description | Expected benefit | Main risk |
| --- | --- | --- | --- |
| FP16/BF16 | baseline half precision | stable baseline and common kernel path | higher weight/KV memory |
| INT8 | lower-precision weights or activations where supported | lower bandwidth and memory | calibration quality, activation outliers, fallback kernels |
| INT4 weight-only | compressed weights while activations may remain higher precision | lower weight memory and potential bandwidth relief | quality loss, dequant overhead, kernel support |
| KV Cache quantization | compressed K/V cache during serving | long-context and concurrency memory relief | decode quality, attention kernel support, implementation maturity |

## Fixed variables

- Model:
- Tokenizer:
- Evaluation set:
- Prompt length distribution:
- Output length distribution:
- Sampling parameters:
- Engine:
- Hardware:

## Metrics

| Metric | Why |
| --- | --- |
| Quality score | captures accuracy/regression risk |
| Peak memory | shows capacity benefit |
| TTFT / TPOT | separates prefill and decode behavior |
| tokens/s | throughput |
| P95/P99 | tail risk |
| Kernel path | detects fallback or unsupported kernels |

## Reading prompts

- Which tensor is being quantized: weights, activations, or KV Cache?
- Is the saving capacity-bound, bandwidth-bound, or both?
- Does the runtime have a kernel that consumes the quantized format directly?
- If the model must dequantize before compute, where does that cost appear?

## Risk checklist

- Is the calibration set representative?
- Are activation outliers handled?
- Does the engine use optimized kernels for this quantization mode?
- Does quality regress on long outputs?
- Does the lower memory allow a larger batch or longer context?
- What is the rollback condition?