Foundation
1 representative questions.
Practice the system explanation inside the page.
This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.
Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.
Grouped by the kind of explanation the interview usually asks for.
1 representative questions.
4 representative questions.
3 representative questions.
1 representative questions.
2 representative questions.
5 representative questions.
Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.
It needs request handling, batching, resource isolation, versioning, rollout control, monitoring, autoscaling and failure handling. The model function is only the core computation.
Serving turns a model artifact into a production API. The system must handle traffic variability, input validation, latency targets, model versions and observability. For LLMs, scheduling and KV cache management become central.
Equating serving with loading a checkpoint in a web server.
Source / Inspiration: LLM module quiz inspiration · MLOps quiz inspiration
Batching lets the accelerator process multiple requests together, improving hardware utilization. The tradeoff is that waiting to form a batch can increase latency.
For fixed-shape models, batching is relatively straightforward. For LLMs, requests have variable prompt and output lengths, so continuous batching can add or remove sequences between decode steps. The scheduler must balance throughput and tail latency.
Thinking bigger batches always improve user experience. Large batches can hurt latency and memory.
Source / Inspiration: LLM quiz inspiration · vLLM docs
Continuous batching schedules active generation requests together at token steps and admits new requests as others finish. It avoids waiting for an entire static batch to complete.
LLM generation has a long decode phase where each request may finish at a different time. Continuous batching keeps the GPU busier, but the scheduler must manage KV cache, fairness, maximum sequence length and latency targets.
Describing it as ordinary request batching. The key distinction is dynamic membership across decode steps.
Source / Inspiration: vLLM docs · LLM quiz inspiration
KV cache stores attention keys and values from previous tokens so decoding does not recompute the whole prefix every step. It speeds generation but grows with batch size, sequence length, layers and head dimensions.
Serving systems must budget KV memory alongside model weights. Long prompts and many concurrent users can exhaust memory even when the model itself fits. Cache eviction, paging and request admission are therefore infrastructure concerns.
Only counting model weights when sizing an LLM server.
Source / Inspiration: Hugging Face KV cache docs · LLM quiz inspiration
PagedAttention manages KV cache in blocks, similar to virtual memory pages. This reduces waste from variable-length sequences and helps serve more concurrent requests.
The important idea is memory management, not a new model architecture. By decoupling logical sequence cache from contiguous physical allocation, serving systems can reduce fragmentation and improve batching flexibility.
Calling PagedAttention a different attention formula. It is a KV cache management technique for serving.
Source / Inspiration: PagedAttention paper · vLLM docs
TTFT measures how long the user waits for the first generated token. TPOT measures the pace of subsequent tokens after generation starts. Track them at P95/P99, not only average latency.
| Metric | Measures | What it reveals |
|---|---|---|
TTFT | Time to first token. | Queueing and prefill latency. |
TPOT | Time per output token after the first. | Decode and cache-read behavior. |
QPS | Requests completed per second. | Admission capacity for the workload mix. |
tokens/s | Generated or processed tokens per second. | Throughput under stated lengths and batching. |
P95/P99 | Tail latency percentiles. | Queue spikes and unfair scheduling. |
TTFT is dominated by queueing, scheduling and prompt prefill. TPOT is dominated by decode efficiency, batching and cache behavior. Optimizing one can hurt the other, so LLM serving dashboards need both.
Using only average request latency. It hides whether users wait at prefill or during generation, and it misses tail spikes that show up in P95/P99.
Source / Inspiration: LLM quiz inspiration · vLLM docs
Kubernetes autoscaling can use resource metrics or custom/external metrics. For LLM serving, teams often combine capacity signals such as queue length, request rate, GPU utilization, token throughput and SLO/tail latency; the right signal depends on workload shape, batching policy and observability setup.
LLM requests vary widely in prompt and output length. Two requests can have very different token work. Autoscaling also needs warmup time, model load time, metric availability and capacity buffers because adding a GPU replica is not instant.
Scaling only on HTTP QPS. Token workload and KV cache pressure matter more.
Source / Inspiration: LLM quiz inspiration · Kubernetes docs
Quantization stores or computes with lower-precision representations, reducing memory footprint and often improving throughput. The risk is quality loss, unsupported kernels or different latency behavior on target hardware.
| Path | What shrinks | Benefit | Risk to validate |
|---|---|---|---|
| Weight-only | Model weights. | Residency and decode bandwidth. | Dequantization overhead and output quality. |
| Weight-activation | Weights and activation compute path. | Lower-precision kernels for suitable workloads. | Kernel coverage and activation outliers. |
| KV cache precision | Serving cache tensors. | More concurrent or longer contexts. | Quality drift and cache backend support. |
A strong answer separates weight-only quantization, activation quantization and KV cache precision. It should also mention validation on task-specific data rather than assuming a generic compression method is safe.
Saying quantization is only a storage optimization. It changes runtime memory and sometimes compute path.
Source / Inspiration: LLM quiz inspiration · Hugging Face Transformers docs
Use a Deployment when you need long-running replicas, rolling updates and service availability. Pair it with a Service, health checks, resource requests and rollout strategy.
For GPU serving, placement, image size, model download, readiness gates and graceful shutdown matter. A pod should not receive traffic before the model is loaded and warm. Rollout strategy should respect capacity limits.
Marking the container ready before model weights and tokenizer are actually usable.
Source / Inspiration: Kubernetes quiz inspiration · Kubernetes docs
RAG retrieves relevant external context and passes it to the model at generation time. It adds embedding, indexing, vector search, reranking and freshness concerns to the serving path.
The model is no longer the only dependency. Retrieval latency, document chunking, vector database recall, prompt construction and source attribution can affect answer quality. Monitoring must cover both generation and retrieval.
Thinking RAG is just a bigger prompt. The retrieval system is a production subsystem with its own failure modes.
Source / Inspiration: LLM quiz inspiration · vLLM docs
It stores embeddings and metadata so semantically similar items can be retrieved by nearest-neighbor search. Metadata filters and document identifiers are usually as important as vectors.
A vector database is useful when retrieval needs approximate similarity at scale. The system still needs chunking, embedding version control, deletion/update semantics and evaluation. Bad chunks or stale embeddings can degrade answer quality.
Assuming vector search guarantees correct answers. Retrieval quality must be measured.
Source / Inspiration: LLM quiz inspiration · External curriculum
Monitor latency percentiles such as P95/P99, error rate, saturation, queue depth, model version, input size, output size, GPU memory, token throughput and business/model quality signals. For RAG, add retrieval latency and hit quality.
The dashboard should connect symptoms to action. Latency without queue depth hides overload; GPU memory without request length hides KV pressure; error rate without model version hides rollout regressions.
Only monitoring infrastructure metrics. Model and request-shape metrics explain many incidents.
Source / Inspiration: Observability quiz inspiration · Prometheus docs
Break latency into queueing, preprocessing, model compute, retrieval, postprocessing and network time. Then inspect request size, batch settings, GPU utilization, cache pressure and recent rollouts.
For LLMs, separate prefill and decode. Long prompts can hurt TTFT, while overloaded decode can hurt token cadence. For RAG, slow retrieval or reranking may dominate even if the model is healthy.
Treating latency as one number. Without phase breakdown, fixes are guesswork.
Source / Inspiration: LLM quiz inspiration · Observability quiz inspiration
Canary exposes a small user slice to a new model; shadow runs the new model on mirrored traffic without affecting users. Both reduce deployment risk.
| Rollout method | Who sees output | Validates | Constraint |
|---|---|---|---|
| Shadow | No user receives new-model output. | Runtime behavior, logs, latency, and capacity. | Cannot prove user outcome impact. |
| Canary | A small live user slice. | Real user quality and service impact with rollback. | Some users are exposed to regressions. |
Models can regress in ways offline tests miss, including latency, calibration, bias, prompt sensitivity or distribution mismatch. Canary validates real user impact with rollback; shadow validates runtime behavior and logs before exposure.
Treating a successful container deploy as a successful model deploy. Model behavior still needs validation.
Source / Inspiration: MLOps quiz inspiration · Observability quiz inspiration
They prevent overload by controlling how many requests, tokens or active sequences enter the system. Good admission control uses memory and token budgets, not only request count.
LLM servers can fail from long prompts or many concurrent decodes. Rejecting or queueing early can preserve SLOs for accepted requests. The policy should be visible to clients and tied to capacity planning.
Using only per-user QPS limits. Token count and sequence length are the actual workload drivers.
Source / Inspiration: LLM quiz inspiration · vLLM docs
Replicated serving puts a full model copy behind each replica; model parallel serving splits one model across multiple GPUs. Replication improves horizontal throughput when the model fits; model parallelism is needed when it does not.
| Serving layout | Model placement | Good fit | Cost |
|---|---|---|---|
| Replicated serving | Each replica owns a full model copy. | Model fits per replica and throughput scaling matters. | Duplicates model memory across replicas. |
| Model-parallel serving | One replica spans multiple GPUs. | Model cannot fit or run efficiently on one GPU. | Cross-GPU communication and topology dependence. |
The tradeoff is communication. Replicas are simpler to scale and isolate, while model-parallel inference has cross-GPU dependencies and topology constraints. Large LLMs often combine tensor parallelism with replica groups.
Assuming adding GPUs always means more independent replicas. The model may require multiple GPUs per replica.
Source / Inspiration: LLM quiz inspiration · NCCL docs
Separate stages when their resource needs, batching policy, or latency objective differ enough to justify explicit transfer and scheduling overhead. For multimodal systems this can also isolate autoregressive, diffusion, and decode stages.
Disaggregation allows independent placement and scaling but makes connector transfer, bounded queues, cancellation and stage-level metrics part of the serving contract.
Claiming disaggregation automatically improves throughput. It is a design choice whose benefit must be measured for the target workload.
Before an interview, you should be able to answer these without reading the page.
Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.