Lab 11: Quantization Comparison
Annotated code reading lab. Running code is optional.
Inference Serving / Tradeoffs
Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
Read code to understand the concept
Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
Core mechanism
- Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
- Quantization can reduce memory footprint and bandwidth, but real speedups depend on what is quantized, calibration or outlier handling, kernel support, and the quality risk of the target workload.
- Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Annotated starter links
These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.
Starter Preview
Excerpt from code/lab-11-quantization/quantization_comparison_plan.md. This preview explains the key idea; the linked starter file is the source of truth.
| Config | Description | Expected benefit | Main risk |
| --- | --- | --- | --- |
| FP16/BF16 | baseline half precision | stable baseline | higher memory |
| INT8 | lower-precision weights/activations where supported | lower bandwidth and memory | calibration and outliers |
| INT4 weight-only | compressed weights | lower weight memory | quality and kernel support |
| KV Cache quantization | compressed K/V cache | long-context memory relief | decode quality and kernel support |Key code blocks
FP16/BF16- Baseline for quality and kernel support.
INT8- Often needs calibration and careful handling of activation ranges.
INT4 weight-only- Saves weight memory, but speed depends on dequantization and optimized kernels.
KV Cache quantization- Targets serving capacity and long-context memory, not training optimizer states.
kernel support- Without optimized kernels, lower precision can fall back to slower paths.
How to read this code
- Always ask what is quantized: weights, activations or KV Cache.
- Quality and performance must be considered together.
- Lower bit-width reduces storage, not automatically end-to-end latency.
What this code does not mean
- “INT4 is always faster than FP16.” Kernel support and workload shape decide.
- “Quantization is only an accuracy topic.” It is also a memory bandwidth and serving capacity topic.
How to say it out loud
Quantization trades numerical precision and implementation complexity for lower memory and bandwidth. Weight-only INT4 can reduce weight memory, INT8 may require calibration, and KV Cache quantization helps long-context serving. I would check quality, kernel support, latency and memory together.
Additional intuition
- Hugging Face quantization docs are the safest place to check which methods are supported in the current Transformers stack. Official: Hugging Face quantization overview
- AWQ and GPTQ papers should be treated as algorithm references; deployment speed still depends on runtime and kernel support. Paper: AWQ
- The Hugging Face GPTQ blog is useful for intuition because it emphasizes calibration data, which is where many real quantization failures start. Blog: Hugging Face AutoGPTQ integration
- vLLM KV-cache quantization docs are a reminder that quantizing KV Cache targets serving capacity and context length pressure, not model-weight storage. Official: vLLM quantized KV Cache
