Overview

AI Infra Interview Practice

Practice the system explanation inside the page.

Part of InfraLens Interview Practice

Use questions to train explanation, not memorization

This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.

#
Reading method

Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.

Map

Question Map

Grouped by the kind of explanation the interview usually asks for.

#
Q&A

Q&A Cards

Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.

01What does AI infrastructure optimize beyond model accuracy?

Short Answer

AI infrastructure turns model code into a repeatable production system. It optimizes throughput, latency, reliability, cost, reproducibility and operational visibility, not only loss curves.

Deeper Explanation

A strong answer separates model quality from the system that trains and serves it. The system has to schedule GPUs, move data, keep artifacts reproducible, expose metrics and recover from failures. Accuracy matters, but production teams also need predictable behavior under load and clear ownership when something breaks.

Common Mistake

Only talking about frameworks such as PyTorch. Frameworks are part of the stack, but infra includes orchestration, storage, networking, deployment and monitoring.

Source / Inspiration: External curriculum coverage · AI infra curriculum README

02How would you debug a slow ML training job?

Short Answer

Start with a timeline: data loading, host-to-device transfer, GPU kernels, backward pass and distributed communication. Then compare GPU utilization, step time variance, dataloader wait time, memory pressure and collective duration.

Deeper Explanation

The goal is to localize the bottleneck before changing code. If the GPU is idle, look at input pipeline and CPU preprocessing. If kernels dominate, profile compute and memory access. If collectives dominate, inspect bucket sizes, topology and overlap. In multi-node training, network and storage are often as important as model code.

Common Mistake

Jumping straight to a larger GPU. More hardware can hide the problem temporarily while increasing cost.

Source / Inspiration: GPU module quiz inspiration · NVIDIA Nsight Systems

03What is the difference between ML infrastructure and MLOps?

Short Answer

ML infrastructure is the platform substrate: compute, storage, networking, serving and observability. MLOps is the operating discipline that versions data/models, automates training and deployment, tracks experiments and monitors model behavior.

Deeper Explanation

They overlap in practice. A platform may provide registries, CI/CD, feature stores and monitoring, while MLOps defines how teams use those tools to ship safely. In interviews, frame infrastructure as capabilities and MLOps as lifecycle control.

Common Mistake

Treating MLOps as only CI/CD. ML systems also need data lineage, model validation, drift monitoring and rollback paths.

Source / Inspiration: MLOps module quiz inspiration · External curriculum

04Why do ML teams containerize training and serving workloads?

Short Answer

Containers make dependencies, CUDA libraries, application code and runtime entrypoints reproducible across machines. They also give schedulers a standard unit to place, restart and resource-limit.

Deeper Explanation

For GPU workloads, containerization is not enough by itself. The host driver, container runtime and GPU device plugin still have to expose devices correctly. A good answer mentions reproducibility and the boundary between image contents and host-level GPU support.

Common Mistake

Assuming a container freezes the GPU driver. The host driver remains part of the compatibility story.

Source / Inspiration: Kubernetes module quiz inspiration · NVIDIA Kubernetes device plugin

05When is Kubernetes useful for ML, and when can it add friction?

Short Answer

Kubernetes helps when teams need scheduling, service discovery, rollouts, autoscaling and standardized deployment across many workloads. It adds friction when GPU topology, storage locality, long-running training jobs or debugging needs are poorly modeled.

Deeper Explanation

Kubernetes is a control plane, not a magic performance layer. It can manage Deployments, Jobs and resource requests, but ML workloads still need correct images, device plugins, checkpointing and careful network/storage design.

Common Mistake

Saying Kubernetes automatically makes ML scalable. It provides orchestration primitives; the workload must still be designed for them.

Source / Inspiration: Kubernetes module quiz inspiration · Kubernetes docs

06How do resource requests and limits affect ML workloads in Kubernetes?

Short Answer

Requests influence scheduling; limits constrain runtime resources. For GPUs, a pod requests whole or partitioned devices exposed by the device plugin, while CPU and memory limits can still throttle or kill the process.

Deeper Explanation

This matters because a model may fail from host memory pressure even when GPU memory looks fine. Requests should reflect what the scheduler must reserve, and limits should avoid accidental noisy-neighbor behavior. For GPUs, placement and topology often matter more than just count.

Common Mistake

Only setting a GPU request and ignoring CPU, RAM, shared memory and storage throughput.

Source / Inspiration: Kubernetes module quiz inspiration · NVIDIA Kubernetes device plugin

07When should you use a Kubernetes Job instead of a Deployment?

Short Answer

Use a Job for finite batch work such as preprocessing, evaluation or a training run. Use a Deployment for continuously running services such as inference APIs.

Deeper Explanation

The difference is lifecycle semantics. A Job wants completion and retry policy; a Deployment wants desired replica count, rolling update behavior and service availability. Training can also need custom controllers or workflow systems when checkpointing and distributed launch are involved.

Common Mistake

Running a one-off batch task as a long-lived Deployment just because it uses the same container image.

Source / Inspiration: Kubernetes module quiz inspiration · Kubernetes workload docs

08What signals belong in ML observability?

Short Answer

Use standard service signals such as latency, errors, saturation, logs and traces, plus ML-specific signals such as input distributions, prediction distributions, model version, feature freshness and label-based quality when available.

Deeper Explanation

Operational metrics tell whether the service is healthy; ML metrics tell whether the model behavior is changing. Good systems attach model version, dataset version and feature pipeline metadata so regressions can be traced back to a rollout or data shift.

Common Mistake

Only monitoring GPU utilization and HTTP status. A model can be fast and available while silently drifting.

Source / Inspiration: Observability module quiz inspiration · Prometheus docs · OpenTelemetry docs

09How do you monitor data or concept drift?

Short Answer

Drift monitoring is production-dependent. Common checks include input data quality, feature/input distribution shift, prediction distribution shift and downstream performance when labels or delayed feedback are available. Concept drift means the relationship between inputs and target changes, so distribution checks are an early warning, not final proof.

Deeper Explanation

The right monitoring design depends on task, data pipeline, label latency and failure cost. Practical drift monitoring often combines statistical summaries, cohort analysis, business metrics and model-quality feedback. The action path matters: alert, inspect examples, retrain or roll back.

Common Mistake

Equating any distribution change with model failure. Some shifts are expected seasonality or product changes.

Source / Inspiration: Observability module quiz inspiration · MLOps module quiz inspiration

10What is a model registry for?

Short Answer

A model registry tracks model artifacts, versions, metadata, approval state and deployment stage. It creates a controlled handoff between experimentation and production.

Deeper Explanation

The registry should connect a model to training code, data, metrics, environment and owner. In production, rollback and auditability depend on knowing exactly which artifact is serving. It is not just a folder of checkpoints.

Common Mistake

Thinking the registry stores only weights. Metadata and lifecycle state are the operational value.

Source / Inspiration: MLOps module quiz inspiration · External curriculum

11Why is experiment tracking more than saving the best checkpoint?

Short Answer

Experiment tracking records parameters, code version, data version, metrics and artifacts so decisions can be explained and repeated. It prevents teams from relying on memory or screenshots.

Deeper Explanation

For infrastructure interviews, emphasize that reproducibility is a system property. A result is useful only if the team can identify what changed and recreate the environment. This also supports model reviews and incident response.

Common Mistake

Logging only final accuracy. Training curves, config, data split and hardware context often explain why a run behaved differently.

Source / Inspiration: MLOps module quiz inspiration · Curriculum getting started

12How would you reduce GPU cost without hurting reliability?

Short Answer

Improve utilization before buying more capacity: batch work, remove input stalls, use mixed precision where valid, checkpoint long jobs, right-size instances and reserve expensive GPUs for GPU-bound stages.

Deeper Explanation

Cost optimization is a tradeoff exercise. Spot/preemptible instances can be effective for checkpointed training, while serving needs enough headroom for latency SLOs. Quantization and batching can reduce serving cost, but they must be validated against quality and latency targets.

Common Mistake

Using spot capacity for a stateful service without graceful draining or fallback.

Source / Inspiration: GPU module quiz inspiration · LLM module quiz inspiration

13How do SLOs and alerts relate?

Short Answer

An SLO defines the target behavior users should receive. Alerts should fire when burn rate or symptoms indicate the SLO is at risk, not for every noisy internal metric.

Deeper Explanation

For ML services, combine service SLOs such as latency and error rate with model-specific guardrails. A good alert is actionable: it points to a likely owner and response path. Too many low-signal alerts train teams to ignore them.

Common Mistake

Alerting directly on every metric threshold. Thresholds without user impact or actionability create alert fatigue.

Source / Inspiration: Observability module quiz inspiration · Prometheus docs

14What problem does a feature store solve?

Short Answer

A feature store helps define, compute and serve features consistently for training and online inference. It reduces training-serving skew and centralizes feature ownership.

Deeper Explanation

The core mechanism is not only storage. It is a contract around feature definitions, freshness, point-in-time correctness and online/offline consistency. For real-time models, latency and freshness requirements determine architecture.

Common Mistake

Treating a feature store as just a database table. Time semantics and consistency are the hard part.

Source / Inspiration: MLOps module quiz inspiration · External curriculum

15Why does data versioning matter for ML systems?

Short Answer

Model behavior depends on code, config and data. Data versioning lets a team reproduce a run, audit a deployment and understand whether a regression came from code or data.

Deeper Explanation

Unlike ordinary software, changing examples can change the product. Versioning should cover raw data snapshots, derived features and labeling rules when possible. It is also useful for rollback and compliance.

Common Mistake

Assuming source control for code is enough to reproduce ML behavior.

Source / Inspiration: MLOps module quiz inspiration · External curriculum

16How would you deploy a model safely?

Short Answer

Start with offline validation, then use shadow traffic, canary rollout or A/B testing depending on risk. Monitor service and model metrics, keep rollback simple and tie every deployment to a versioned artifact.

MethodExposureQuestion answeredConstraint
Shadow trafficMirrored requests; output is not user-visible.Does the candidate behave operationally on real traffic?Does not measure live user outcome.
Canary rolloutSmall live user slice.Is impact acceptable with limited blast radius?Needs rollback thresholds and monitoring.
A/B testingExperiment cohorts receive competing variants.Which model improves defined product outcomes?Needs experimental design and enough traffic.

Deeper Explanation

Safe deployment is about reducing blast radius. Shadow traffic validates runtime behavior without user impact. Canary validates a small slice of real traffic. A/B tests compare product outcomes but need experimental design.

Common Mistake

Shipping a new model because offline metrics improved. Offline gains can fail under real traffic, latency or distribution shift.

Source / Inspiration: MLOps module quiz inspiration · Observability module quiz inspiration

17What role does object storage play in ML infrastructure?

Short Answer

Object storage commonly holds datasets, checkpoints, model artifacts, logs and intermediate outputs. It decouples durable artifacts from ephemeral compute.

Deeper Explanation

GPU nodes should be treated as replaceable. Durable storage makes preemption, retries and multi-stage pipelines practical. The tradeoff is that object storage access patterns and throughput must be designed carefully for training workloads.

Common Mistake

Reading many tiny files directly from remote object storage during training without caching or packing.

Source / Inspiration: External curriculum · Curriculum getting started

18What does infrastructure as code add to ML platforms?

Short Answer

Infrastructure as code makes clusters, networks, storage and permissions reviewable, repeatable and auditable. It reduces configuration drift between environments.

Deeper Explanation

In ML platforms, IaC helps reproduce GPU pools, serving environments and data access controls. It also gives teams a change history for incidents. The operational risk is managing state and secrets correctly.

Common Mistake

Using IaC as a dumping ground for manual fixes without review or environment boundaries.

Source / Inspiration: External curriculum · AI infra curriculum README

19How do you approach a production OOM incident?

Short Answer

Classify the memory: CPU RAM, GPU weights, activations, optimizer state, KV cache, allocator fragmentation or batch/input growth. Then reduce the specific pressure with batch limits, quantization, checkpointing, sharding, cache policy or resource limits.

Deeper Explanation

OOM debugging requires matching the workload phase. Training OOMs often involve activations and optimizer states. LLM serving OOMs often involve KV cache and variable request lengths. Container limits can also kill a process before GPU memory is exhausted.

Common Mistake

Only lowering batch size. It may work, but it hides whether the real cause is a leak, longer prompts or a deployment limit.

Source / Inspiration: GPU module quiz inspiration · LLM module quiz inspiration

20What makes LLM infrastructure different from ordinary model serving?

Short Answer

LLM serving has token-by-token generation, large model weights, KV cache growth, variable prompt/output lengths and strong latency-throughput tradeoffs. Metrics such as TTFT and tokens per second matter more than only request latency.

Deeper Explanation

The system has two phases: prefill processes the prompt, and decode generates one or more tokens per step. Continuous batching, cache management, quantization and scheduling become first-class infrastructure concerns. RAG adds retrieval latency and data freshness to the path.

PhaseWorkOperational concern
PrefillProcesses prompt tokens and constructs KV cache state.TTFT and long-prompt admission.
DecodeProduces output tokens while reading and growing KV cache.TPOT, cache capacity, and batching fairness.

Common Mistake

Treating an LLM endpoint like a fixed-shape classifier. Dynamic sequence length changes memory, scheduling and latency behavior.

Source / Inspiration: LLM module quiz inspiration · vLLM docs · Hugging Face KV cache docs

Review

Final Review Checklist

Before an interview, you should be able to answer these without reading the page.

#
  • What does AI infrastructure optimize beyond model accuracy?
  • How would you debug a slow ML training job?
  • What is the difference between ML infrastructure and MLOps?
  • Why do ML teams containerize training and serving workloads?
  • When is Kubernetes useful for ML, and when can it add friction?
  • How do resource requests and limits affect ML workloads in Kubernetes?
  • When should you use a Kubernetes Job instead of a Deployment?
  • What signals belong in ML observability?
Sources

Sources and Further Reading

Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.

#