Lab 11 - Quantization Comparison

Overview

Lab 11: Quantization Comparison

Annotated code reading lab. Running code is optional.

Related handbook section

Inference Serving / Tradeoffs

Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.

Inference Serving Tradeoffs

Concept Goal

Read code to understand the concept

Mental Model

Core mechanism

Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
Quantization can reduce memory footprint and bandwidth, but real speedups depend on what is quantized, calibration or outlier handling, kernel support, and the quality risk of the target workload.
Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

README quantization_comparison_plan.md Optional note template

Annotated Code Preview

Starter Preview

Excerpt from code/lab-11-quantization/quantization_comparison_plan.md. This preview explains the key idea; the linked starter file is the source of truth.

Open starter file

| Config | Description | Expected benefit | Main risk |
| --- | --- | --- | --- |
| FP16/BF16 | baseline half precision | stable baseline | higher memory |
| INT8 | lower-precision weights/activations where supported | lower bandwidth and memory | calibration and outliers |
| INT4 weight-only | compressed weights | lower weight memory | quality and kernel support |
| KV Cache quantization | compressed K/V cache | long-context memory relief | decode quality and kernel support |

Line-by-line Explanation

Key code blocks

FP16/BF16: Baseline for quality and kernel support.
INT8: Often needs calibration and careful handling of activation ranges.
INT4 weight-only: Saves weight memory, but speed depends on dequantization and optimized kernels.
KV Cache quantization: Targets serving capacity and long-context memory, not training optimizer states.
kernel support: Without optimized kernels, lower precision can fall back to slower paths.

What to Notice

How to read this code

Always ask what is quantized: weights, activations or KV Cache.
Quality and performance must be considered together.
Lower bit-width reduces storage, not automatically end-to-end latency.

Common Misunderstandings

What this code does not mean

“INT4 is always faster than FP16.” Kernel support and workload shape decide.
“Quantization is only an accuracy topic.” It is also a memory bandwidth and serving capacity topic.

Interview Explanation

How to say it out loud

Quantization trades numerical precision and implementation complexity for lower memory and bandwidth. Weight-only INT4 can reduce weight memory, INT8 may require calibration, and KV Cache quantization helps long-context serving. I would check quality, kernel support, latency and memory together.

External intuition notes

Additional intuition

Hugging Face quantization docs are the safest place to check which methods are supported in the current Transformers stack. Official: Hugging Face quantization overview
AWQ and GPTQ papers should be treated as algorithm references; deployment speed still depends on runtime and kernel support. Paper: AWQ
The Hugging Face GPTQ blog is useful for intuition because it emphasizes calibration data, which is where many real quantization failures start. Blog: Hugging Face AutoGPTQ integration
vLLM KV-cache quantization docs are a reminder that quantizing KV Cache targets serving capacity and context length pressure, not model-weight storage. Official: vLLM quantized KV Cache

InfraLens