InfraLens

A clear starting point for learning AI infrastructure.

Overview

Lab 11: Quantization Comparison

Annotated code reading lab. Running code is optional.

Concept Goal

Read code to understand the concept

Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.

Mental Model

Core mechanism

  • Inference systems separate prefill, decode, batching, KV Cache capacity, and tail latency. The useful explanation names the bottleneck first, then connects it to memory bandwidth, scheduling, and measured serving metrics.
  • Quantization can reduce memory footprint and bandwidth, but real speedups depend on what is quantized, calibration or outlier handling, kernel support, and the quality risk of the target workload.
  • Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

Annotated Code Preview

Starter Preview

Excerpt from code/lab-11-quantization/quantization_comparison_plan.md. This preview explains the key idea; the linked starter file is the source of truth.

| Config | Description | Expected benefit | Main risk |
| --- | --- | --- | --- |
| FP16/BF16 | baseline half precision | stable baseline | higher memory |
| INT8 | lower-precision weights/activations where supported | lower bandwidth and memory | calibration and outliers |
| INT4 weight-only | compressed weights | lower weight memory | quality and kernel support |
| KV Cache quantization | compressed K/V cache | long-context memory relief | decode quality and kernel support |
Line-by-line Explanation

Key code blocks

FP16/BF16
Baseline for quality and kernel support.
INT8
Often needs calibration and careful handling of activation ranges.
INT4 weight-only
Saves weight memory, but speed depends on dequantization and optimized kernels.
KV Cache quantization
Targets serving capacity and long-context memory, not training optimizer states.
kernel support
Without optimized kernels, lower precision can fall back to slower paths.
What to Notice

How to read this code

  • Always ask what is quantized: weights, activations or KV Cache.
  • Quality and performance must be considered together.
  • Lower bit-width reduces storage, not automatically end-to-end latency.
Common Misunderstandings

What this code does not mean

  • “INT4 is always faster than FP16.” Kernel support and workload shape decide.
  • “Quantization is only an accuracy topic.” It is also a memory bandwidth and serving capacity topic.
Interview Explanation

How to say it out loud

Quantization trades numerical precision and implementation complexity for lower memory and bandwidth. Weight-only INT4 can reduce weight memory, INT8 may require calibration, and KV Cache quantization helps long-context serving. I would check quality, kernel support, latency and memory together.

External intuition notes

Additional intuition

  • Hugging Face quantization docs are the safest place to check which methods are supported in the current Transformers stack. Official: Hugging Face quantization overview
  • AWQ and GPTQ papers should be treated as algorithm references; deployment speed still depends on runtime and kernel support. Paper: AWQ
  • The Hugging Face GPTQ blog is useful for intuition because it emphasizes calibration data, which is where many real quantization failures start. Blog: Hugging Face AutoGPTQ integration
  • vLLM KV-cache quantization docs are a reminder that quantizing KV Cache targets serving capacity and context length pressure, not model-weight storage. Official: vLLM quantized KV Cache
Further Reading

Official, paper and practical references