Overview

Model Serving Interview Practice

Practice the system explanation inside the page.

Part of InfraLens Interview Practice

Use questions to train explanation, not memorization

This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.

#
Reading method

Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.

Map

Question Map

Grouped by the kind of explanation the interview usually asks for.

#
Q&A

Q&A Cards

Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.

01What does a model serving system need beyond a prediction function?

Short Answer

It needs request handling, batching, resource isolation, versioning, rollout control, monitoring, autoscaling and failure handling. The model function is only the core computation.

Deeper Explanation

Serving turns a model artifact into a production API. The system must handle traffic variability, input validation, latency targets, model versions and observability. For LLMs, scheduling and KV cache management become central.

Common Mistake

Equating serving with loading a checkpoint in a web server.

Source / Inspiration: LLM module quiz inspiration · MLOps quiz inspiration

02Why does batching improve inference throughput?

Short Answer

Batching lets the accelerator process multiple requests together, improving hardware utilization. The tradeoff is that waiting to form a batch can increase latency.

Deeper Explanation

For fixed-shape models, batching is relatively straightforward. For LLMs, requests have variable prompt and output lengths, so continuous batching can add or remove sequences between decode steps. The scheduler must balance throughput and tail latency.

Common Mistake

Thinking bigger batches always improve user experience. Large batches can hurt latency and memory.

Source / Inspiration: LLM quiz inspiration · vLLM docs

03What is continuous batching in LLM serving?

Short Answer

Continuous batching schedules active generation requests together at token steps and admits new requests as others finish. It avoids waiting for an entire static batch to complete.

Deeper Explanation

LLM generation has a long decode phase where each request may finish at a different time. Continuous batching keeps the GPU busier, but the scheduler must manage KV cache, fairness, maximum sequence length and latency targets.

Common Mistake

Describing it as ordinary request batching. The key distinction is dynamic membership across decode steps.

Source / Inspiration: vLLM docs · LLM quiz inspiration

04What is KV cache, and why does it matter for serving?

Short Answer

KV cache stores attention keys and values from previous tokens so decoding does not recompute the whole prefix every step. It speeds generation but grows with batch size, sequence length, layers and head dimensions.

Deeper Explanation

Serving systems must budget KV memory alongside model weights. Long prompts and many concurrent users can exhaust memory even when the model itself fits. Cache eviction, paging and request admission are therefore infrastructure concerns.

Common Mistake

Only counting model weights when sizing an LLM server.

Source / Inspiration: Hugging Face KV cache docs · LLM quiz inspiration

05What does PagedAttention change?

Short Answer

PagedAttention manages KV cache in blocks, similar to virtual memory pages. This reduces waste from variable-length sequences and helps serve more concurrent requests.

Deeper Explanation

The important idea is memory management, not a new model architecture. By decoupling logical sequence cache from contiguous physical allocation, serving systems can reduce fragmentation and improve batching flexibility.

Common Mistake

Calling PagedAttention a different attention formula. It is a KV cache management technique for serving.

Source / Inspiration: PagedAttention paper · vLLM docs

06How do TTFT and TPOT differ?

Short Answer

TTFT measures how long the user waits for the first generated token. TPOT measures the pace of subsequent tokens after generation starts. Track them at P95/P99, not only average latency.

MetricMeasuresWhat it reveals
TTFTTime to first token.Queueing and prefill latency.
TPOTTime per output token after the first.Decode and cache-read behavior.
QPSRequests completed per second.Admission capacity for the workload mix.
tokens/sGenerated or processed tokens per second.Throughput under stated lengths and batching.
P95/P99Tail latency percentiles.Queue spikes and unfair scheduling.

Deeper Explanation

TTFT is dominated by queueing, scheduling and prompt prefill. TPOT is dominated by decode efficiency, batching and cache behavior. Optimizing one can hurt the other, so LLM serving dashboards need both.

Common Mistake

Using only average request latency. It hides whether users wait at prefill or during generation, and it misses tail spikes that show up in P95/P99.

Source / Inspiration: LLM quiz inspiration · vLLM docs

07How would you autoscale an LLM service?

Short Answer

Kubernetes autoscaling can use resource metrics or custom/external metrics. For LLM serving, teams often combine capacity signals such as queue length, request rate, GPU utilization, token throughput and SLO/tail latency; the right signal depends on workload shape, batching policy and observability setup.

Deeper Explanation

LLM requests vary widely in prompt and output length. Two requests can have very different token work. Autoscaling also needs warmup time, model load time, metric availability and capacity buffers because adding a GPU replica is not instant.

Common Mistake

Scaling only on HTTP QPS. Token workload and KV cache pressure matter more.

Source / Inspiration: LLM quiz inspiration · Kubernetes docs

08How does quantization help serving, and what is the risk?

Short Answer

Quantization stores or computes with lower-precision representations, reducing memory footprint and often improving throughput. The risk is quality loss, unsupported kernels or different latency behavior on target hardware.

PathWhat shrinksBenefitRisk to validate
Weight-onlyModel weights.Residency and decode bandwidth.Dequantization overhead and output quality.
Weight-activationWeights and activation compute path.Lower-precision kernels for suitable workloads.Kernel coverage and activation outliers.
KV cache precisionServing cache tensors.More concurrent or longer contexts.Quality drift and cache backend support.

Deeper Explanation

A strong answer separates weight-only quantization, activation quantization and KV cache precision. It should also mention validation on task-specific data rather than assuming a generic compression method is safe.

Common Mistake

Saying quantization is only a storage optimization. It changes runtime memory and sometimes compute path.

Source / Inspiration: LLM quiz inspiration · Hugging Face Transformers docs

09When would you use a Kubernetes Deployment for serving?

Short Answer

Use a Deployment when you need long-running replicas, rolling updates and service availability. Pair it with a Service, health checks, resource requests and rollout strategy.

Deeper Explanation

For GPU serving, placement, image size, model download, readiness gates and graceful shutdown matter. A pod should not receive traffic before the model is loaded and warm. Rollout strategy should respect capacity limits.

Common Mistake

Marking the container ready before model weights and tokenizer are actually usable.

Source / Inspiration: Kubernetes quiz inspiration · Kubernetes docs

10What is RAG, and why does it change serving architecture?

Short Answer

RAG retrieves relevant external context and passes it to the model at generation time. It adds embedding, indexing, vector search, reranking and freshness concerns to the serving path.

Deeper Explanation

The model is no longer the only dependency. Retrieval latency, document chunking, vector database recall, prompt construction and source attribution can affect answer quality. Monitoring must cover both generation and retrieval.

Common Mistake

Thinking RAG is just a bigger prompt. The retrieval system is a production subsystem with its own failure modes.

Source / Inspiration: LLM quiz inspiration · vLLM docs

11What does a vector database store for RAG?

Short Answer

It stores embeddings and metadata so semantically similar items can be retrieved by nearest-neighbor search. Metadata filters and document identifiers are usually as important as vectors.

Deeper Explanation

A vector database is useful when retrieval needs approximate similarity at scale. The system still needs chunking, embedding version control, deletion/update semantics and evaluation. Bad chunks or stale embeddings can degrade answer quality.

Common Mistake

Assuming vector search guarantees correct answers. Retrieval quality must be measured.

Source / Inspiration: LLM quiz inspiration · External curriculum

12What should you monitor for an inference API?

Short Answer

Monitor latency percentiles such as P95/P99, error rate, saturation, queue depth, model version, input size, output size, GPU memory, token throughput and business/model quality signals. For RAG, add retrieval latency and hit quality.

Deeper Explanation

The dashboard should connect symptoms to action. Latency without queue depth hides overload; GPU memory without request length hides KV pressure; error rate without model version hides rollout regressions.

Common Mistake

Only monitoring infrastructure metrics. Model and request-shape metrics explain many incidents.

Source / Inspiration: Observability quiz inspiration · Prometheus docs

13How would you troubleshoot high serving latency?

Short Answer

Break latency into queueing, preprocessing, model compute, retrieval, postprocessing and network time. Then inspect request size, batch settings, GPU utilization, cache pressure and recent rollouts.

Deeper Explanation

For LLMs, separate prefill and decode. Long prompts can hurt TTFT, while overloaded decode can hurt token cadence. For RAG, slow retrieval or reranking may dominate even if the model is healthy.

Common Mistake

Treating latency as one number. Without phase breakdown, fixes are guesswork.

Source / Inspiration: LLM quiz inspiration · Observability quiz inspiration

14Why use canary or shadow deployment for models?

Short Answer

Canary exposes a small user slice to a new model; shadow runs the new model on mirrored traffic without affecting users. Both reduce deployment risk.

Rollout methodWho sees outputValidatesConstraint
ShadowNo user receives new-model output.Runtime behavior, logs, latency, and capacity.Cannot prove user outcome impact.
CanaryA small live user slice.Real user quality and service impact with rollback.Some users are exposed to regressions.

Deeper Explanation

Models can regress in ways offline tests miss, including latency, calibration, bias, prompt sensitivity or distribution mismatch. Canary validates real user impact with rollback; shadow validates runtime behavior and logs before exposure.

Common Mistake

Treating a successful container deploy as a successful model deploy. Model behavior still needs validation.

Source / Inspiration: MLOps quiz inspiration · Observability quiz inspiration

15How do rate limiting and admission control protect LLM serving?

Short Answer

They prevent overload by controlling how many requests, tokens or active sequences enter the system. Good admission control uses memory and token budgets, not only request count.

Deeper Explanation

LLM servers can fail from long prompts or many concurrent decodes. Rejecting or queueing early can preserve SLOs for accepted requests. The policy should be visible to clients and tied to capacity planning.

Common Mistake

Using only per-user QPS limits. Token count and sequence length are the actual workload drivers.

Source / Inspiration: LLM quiz inspiration · vLLM docs

16What is the difference between model parallel serving and replicated serving?

Short Answer

Replicated serving puts a full model copy behind each replica; model parallel serving splits one model across multiple GPUs. Replication improves horizontal throughput when the model fits; model parallelism is needed when it does not.

Serving layoutModel placementGood fitCost
Replicated servingEach replica owns a full model copy.Model fits per replica and throughput scaling matters.Duplicates model memory across replicas.
Model-parallel servingOne replica spans multiple GPUs.Model cannot fit or run efficiently on one GPU.Cross-GPU communication and topology dependence.

Deeper Explanation

The tradeoff is communication. Replicas are simpler to scale and isolate, while model-parallel inference has cross-GPU dependencies and topology constraints. Large LLMs often combine tensor parallelism with replica groups.

Common Mistake

Assuming adding GPUs always means more independent replicas. The model may require multiple GPUs per replica.

Source / Inspiration: LLM quiz inspiration · NCCL docs

17When would you separate prefill and decode or other generation stages?

Short Answer

Separate stages when their resource needs, batching policy, or latency objective differ enough to justify explicit transfer and scheduling overhead. For multimodal systems this can also isolate autoregressive, diffusion, and decode stages.

Deeper Explanation

Disaggregation allows independent placement and scaling but makes connector transfer, bounded queues, cancellation and stage-level metrics part of the serving contract.

Common Mistake

Claiming disaggregation automatically improves throughput. It is a design choice whose benefit must be measured for the target workload.

Source: vLLM disaggregated prefilling · vLLM-Omni stages

Review

Final Review Checklist

Before an interview, you should be able to answer these without reading the page.

#
  • What does a model serving system need beyond a prediction function?
  • Why does batching improve inference throughput?
  • What is continuous batching in LLM serving?
  • What is KV cache, and why does it matter for serving?
  • What does PagedAttention change?
  • How do TTFT and TPOT differ?
  • How would you autoscale an LLM service?
  • How does quantization help serving, and what is the risk?
Sources

Sources and Further Reading

Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.

#