Foundation
4 representative questions.
Practice the system explanation inside the page.
This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.
Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.
Grouped by the kind of explanation the interview usually asks for.
4 representative questions.
2 representative questions.
1 representative questions.
3 representative questions.
2 representative questions.
8 representative questions.
Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.
AI infrastructure turns model code into a repeatable production system. It optimizes throughput, latency, reliability, cost, reproducibility and operational visibility, not only loss curves.
A strong answer separates model quality from the system that trains and serves it. The system has to schedule GPUs, move data, keep artifacts reproducible, expose metrics and recover from failures. Accuracy matters, but production teams also need predictable behavior under load and clear ownership when something breaks.
Only talking about frameworks such as PyTorch. Frameworks are part of the stack, but infra includes orchestration, storage, networking, deployment and monitoring.
Source / Inspiration: External curriculum coverage · AI infra curriculum README
Start with a timeline: data loading, host-to-device transfer, GPU kernels, backward pass and distributed communication. Then compare GPU utilization, step time variance, dataloader wait time, memory pressure and collective duration.
The goal is to localize the bottleneck before changing code. If the GPU is idle, look at input pipeline and CPU preprocessing. If kernels dominate, profile compute and memory access. If collectives dominate, inspect bucket sizes, topology and overlap. In multi-node training, network and storage are often as important as model code.
Jumping straight to a larger GPU. More hardware can hide the problem temporarily while increasing cost.
Source / Inspiration: GPU module quiz inspiration · NVIDIA Nsight Systems
ML infrastructure is the platform substrate: compute, storage, networking, serving and observability. MLOps is the operating discipline that versions data/models, automates training and deployment, tracks experiments and monitors model behavior.
They overlap in practice. A platform may provide registries, CI/CD, feature stores and monitoring, while MLOps defines how teams use those tools to ship safely. In interviews, frame infrastructure as capabilities and MLOps as lifecycle control.
Treating MLOps as only CI/CD. ML systems also need data lineage, model validation, drift monitoring and rollback paths.
Source / Inspiration: MLOps module quiz inspiration · External curriculum
Containers make dependencies, CUDA libraries, application code and runtime entrypoints reproducible across machines. They also give schedulers a standard unit to place, restart and resource-limit.
For GPU workloads, containerization is not enough by itself. The host driver, container runtime and GPU device plugin still have to expose devices correctly. A good answer mentions reproducibility and the boundary between image contents and host-level GPU support.
Assuming a container freezes the GPU driver. The host driver remains part of the compatibility story.
Source / Inspiration: Kubernetes module quiz inspiration · NVIDIA Kubernetes device plugin
Kubernetes helps when teams need scheduling, service discovery, rollouts, autoscaling and standardized deployment across many workloads. It adds friction when GPU topology, storage locality, long-running training jobs or debugging needs are poorly modeled.
Kubernetes is a control plane, not a magic performance layer. It can manage Deployments, Jobs and resource requests, but ML workloads still need correct images, device plugins, checkpointing and careful network/storage design.
Saying Kubernetes automatically makes ML scalable. It provides orchestration primitives; the workload must still be designed for them.
Source / Inspiration: Kubernetes module quiz inspiration · Kubernetes docs
Requests influence scheduling; limits constrain runtime resources. For GPUs, a pod requests whole or partitioned devices exposed by the device plugin, while CPU and memory limits can still throttle or kill the process.
This matters because a model may fail from host memory pressure even when GPU memory looks fine. Requests should reflect what the scheduler must reserve, and limits should avoid accidental noisy-neighbor behavior. For GPUs, placement and topology often matter more than just count.
Only setting a GPU request and ignoring CPU, RAM, shared memory and storage throughput.
Source / Inspiration: Kubernetes module quiz inspiration · NVIDIA Kubernetes device plugin
Use a Job for finite batch work such as preprocessing, evaluation or a training run. Use a Deployment for continuously running services such as inference APIs.
The difference is lifecycle semantics. A Job wants completion and retry policy; a Deployment wants desired replica count, rolling update behavior and service availability. Training can also need custom controllers or workflow systems when checkpointing and distributed launch are involved.
Running a one-off batch task as a long-lived Deployment just because it uses the same container image.
Source / Inspiration: Kubernetes module quiz inspiration · Kubernetes workload docs
Use standard service signals such as latency, errors, saturation, logs and traces, plus ML-specific signals such as input distributions, prediction distributions, model version, feature freshness and label-based quality when available.
Operational metrics tell whether the service is healthy; ML metrics tell whether the model behavior is changing. Good systems attach model version, dataset version and feature pipeline metadata so regressions can be traced back to a rollout or data shift.
Only monitoring GPU utilization and HTTP status. A model can be fast and available while silently drifting.
Source / Inspiration: Observability module quiz inspiration · Prometheus docs · OpenTelemetry docs
Drift monitoring is production-dependent. Common checks include input data quality, feature/input distribution shift, prediction distribution shift and downstream performance when labels or delayed feedback are available. Concept drift means the relationship between inputs and target changes, so distribution checks are an early warning, not final proof.
The right monitoring design depends on task, data pipeline, label latency and failure cost. Practical drift monitoring often combines statistical summaries, cohort analysis, business metrics and model-quality feedback. The action path matters: alert, inspect examples, retrain or roll back.
Equating any distribution change with model failure. Some shifts are expected seasonality or product changes.
Source / Inspiration: Observability module quiz inspiration · MLOps module quiz inspiration
A model registry tracks model artifacts, versions, metadata, approval state and deployment stage. It creates a controlled handoff between experimentation and production.
The registry should connect a model to training code, data, metrics, environment and owner. In production, rollback and auditability depend on knowing exactly which artifact is serving. It is not just a folder of checkpoints.
Thinking the registry stores only weights. Metadata and lifecycle state are the operational value.
Source / Inspiration: MLOps module quiz inspiration · External curriculum
Experiment tracking records parameters, code version, data version, metrics and artifacts so decisions can be explained and repeated. It prevents teams from relying on memory or screenshots.
For infrastructure interviews, emphasize that reproducibility is a system property. A result is useful only if the team can identify what changed and recreate the environment. This also supports model reviews and incident response.
Logging only final accuracy. Training curves, config, data split and hardware context often explain why a run behaved differently.
Source / Inspiration: MLOps module quiz inspiration · Curriculum getting started
Improve utilization before buying more capacity: batch work, remove input stalls, use mixed precision where valid, checkpoint long jobs, right-size instances and reserve expensive GPUs for GPU-bound stages.
Cost optimization is a tradeoff exercise. Spot/preemptible instances can be effective for checkpointed training, while serving needs enough headroom for latency SLOs. Quantization and batching can reduce serving cost, but they must be validated against quality and latency targets.
Using spot capacity for a stateful service without graceful draining or fallback.
Source / Inspiration: GPU module quiz inspiration · LLM module quiz inspiration
An SLO defines the target behavior users should receive. Alerts should fire when burn rate or symptoms indicate the SLO is at risk, not for every noisy internal metric.
For ML services, combine service SLOs such as latency and error rate with model-specific guardrails. A good alert is actionable: it points to a likely owner and response path. Too many low-signal alerts train teams to ignore them.
Alerting directly on every metric threshold. Thresholds without user impact or actionability create alert fatigue.
Source / Inspiration: Observability module quiz inspiration · Prometheus docs
A feature store helps define, compute and serve features consistently for training and online inference. It reduces training-serving skew and centralizes feature ownership.
The core mechanism is not only storage. It is a contract around feature definitions, freshness, point-in-time correctness and online/offline consistency. For real-time models, latency and freshness requirements determine architecture.
Treating a feature store as just a database table. Time semantics and consistency are the hard part.
Source / Inspiration: MLOps module quiz inspiration · External curriculum
Model behavior depends on code, config and data. Data versioning lets a team reproduce a run, audit a deployment and understand whether a regression came from code or data.
Unlike ordinary software, changing examples can change the product. Versioning should cover raw data snapshots, derived features and labeling rules when possible. It is also useful for rollback and compliance.
Assuming source control for code is enough to reproduce ML behavior.
Source / Inspiration: MLOps module quiz inspiration · External curriculum
Start with offline validation, then use shadow traffic, canary rollout or A/B testing depending on risk. Monitor service and model metrics, keep rollback simple and tie every deployment to a versioned artifact.
| Method | Exposure | Question answered | Constraint |
|---|---|---|---|
| Shadow traffic | Mirrored requests; output is not user-visible. | Does the candidate behave operationally on real traffic? | Does not measure live user outcome. |
| Canary rollout | Small live user slice. | Is impact acceptable with limited blast radius? | Needs rollback thresholds and monitoring. |
| A/B testing | Experiment cohorts receive competing variants. | Which model improves defined product outcomes? | Needs experimental design and enough traffic. |
Safe deployment is about reducing blast radius. Shadow traffic validates runtime behavior without user impact. Canary validates a small slice of real traffic. A/B tests compare product outcomes but need experimental design.
Shipping a new model because offline metrics improved. Offline gains can fail under real traffic, latency or distribution shift.
Source / Inspiration: MLOps module quiz inspiration · Observability module quiz inspiration
Object storage commonly holds datasets, checkpoints, model artifacts, logs and intermediate outputs. It decouples durable artifacts from ephemeral compute.
GPU nodes should be treated as replaceable. Durable storage makes preemption, retries and multi-stage pipelines practical. The tradeoff is that object storage access patterns and throughput must be designed carefully for training workloads.
Reading many tiny files directly from remote object storage during training without caching or packing.
Source / Inspiration: External curriculum · Curriculum getting started
Infrastructure as code makes clusters, networks, storage and permissions reviewable, repeatable and auditable. It reduces configuration drift between environments.
In ML platforms, IaC helps reproduce GPU pools, serving environments and data access controls. It also gives teams a change history for incidents. The operational risk is managing state and secrets correctly.
Using IaC as a dumping ground for manual fixes without review or environment boundaries.
Source / Inspiration: External curriculum · AI infra curriculum README
Classify the memory: CPU RAM, GPU weights, activations, optimizer state, KV cache, allocator fragmentation or batch/input growth. Then reduce the specific pressure with batch limits, quantization, checkpointing, sharding, cache policy or resource limits.
OOM debugging requires matching the workload phase. Training OOMs often involve activations and optimizer states. LLM serving OOMs often involve KV cache and variable request lengths. Container limits can also kill a process before GPU memory is exhausted.
Only lowering batch size. It may work, but it hides whether the real cause is a leak, longer prompts or a deployment limit.
Source / Inspiration: GPU module quiz inspiration · LLM module quiz inspiration
LLM serving has token-by-token generation, large model weights, KV cache growth, variable prompt/output lengths and strong latency-throughput tradeoffs. Metrics such as TTFT and tokens per second matter more than only request latency.
The system has two phases: prefill processes the prompt, and decode generates one or more tokens per step. Continuous batching, cache management, quantization and scheduling become first-class infrastructure concerns. RAG adds retrieval latency and data freshness to the path.
| Phase | Work | Operational concern |
|---|---|---|
| Prefill | Processes prompt tokens and constructs KV cache state. | TTFT and long-prompt admission. |
| Decode | Produces output tokens while reading and growing KV cache. | TPOT, cache capacity, and batching fairness. |
Treating an LLM endpoint like a fixed-shape classifier. Dynamic sequence length changes memory, scheduling and latency behavior.
Source / Inspiration: LLM module quiz inspiration · vLLM docs · Hugging Face KV cache docs
Before an interview, you should be able to answer these without reading the page.
Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.