Foundation
2 representative questions.
Practice the system explanation inside the page.
This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.
Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.
Grouped by the kind of explanation the interview usually asks for.
2 representative questions.
7 representative questions.
3 representative questions.
2 representative questions.
2 representative questions.
Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.
GPUs run many simple operations in parallel and have high memory bandwidth, which fits dense tensor operations. They are strongest when work is batched and expressed as regular kernels.
A CPU is optimized for low-latency general-purpose control flow; a GPU is optimized for throughput. ML training and inference spend much of their time in matrix multiplies, convolutions and elementwise kernels, which map well to GPU execution.
Saying GPUs are always faster. Small batches, branching-heavy code or data stalls can leave the GPU underused.
Source / Inspiration: GPU module quiz inspiration · CUDA C Programming Guide
CUDA is NVIDIA's programming and runtime platform for GPU computing. Frameworks such as PyTorch use CUDA libraries and kernels under the hood to execute tensor operations.
Interview answers should distinguish CUDA source APIs, CUDA runtime, libraries such as cuBLAS/cuDNN, and the host driver. Most ML engineers do not write every kernel, but they must understand enough to debug memory, compatibility and performance issues.
Treating CUDA as only a Python package. It spans driver/runtime/libraries and hardware execution.
Source / Inspiration: GPU module quiz inspiration · CUDA guide
Global memory is large but slower; shared memory is smaller and shared within a block; registers are fastest and private to threads. Caches sit between these layers and affect observed performance.
The hierarchy matters because many kernels are memory-bandwidth bound. Good kernels improve locality, reuse data in shared memory or registers and avoid unnecessary global reads/writes. The exact layout depends on hardware, so check current NVIDIA docs for details.
Only saying GPU memory is VRAM. Kernel performance depends on where data is accessed inside the hierarchy.
Source / Inspiration: CUDA memory hierarchy · GPU quiz inspiration
Coalescing means adjacent threads access adjacent memory so the hardware can combine requests efficiently. It improves bandwidth use for global memory loads and stores.
In tensor code, layout and stride decide whether neighboring threads touch neighboring addresses. Non-contiguous views, transposes and irregular indexing can destroy coalescing. This is why shape and stride awareness matters even in high-level frameworks.
Thinking coalescing changes mathematical results. It changes memory transaction efficiency.
Source / Inspiration: CUDA memory coalescing
Occupancy measures how many warps can be resident on an SM relative to the hardware maximum. Higher occupancy can help hide latency, but it is not automatically better; register pressure, shared memory usage, memory bandwidth, instruction mix and data locality can dominate performance.
Register use, shared memory use, memory bandwidth, instruction mix and tensor core utilization all matter. Sometimes a kernel with lower occupancy but more data reuse wins. Profiling should guide tuning.
Optimizing occupancy as a standalone metric. It is a signal, not the objective.
Source / Inspiration: CUDA occupancy · NVIDIA Nsight Systems
CUDA streams order work on a GPU. Separate streams can allow overlapping independent kernels or memory transfers when dependencies and hardware permit.
Streams are useful for pipelining data transfers, preprocessing and compute, but they require correct synchronization. Frameworks manage many stream details, yet custom extensions and serving systems can still hit stream-related bugs.
Assuming different streams always run concurrently. Dependencies and hardware resources decide actual overlap.
Source / Inspiration: CUDA streams · GPU quiz inspiration
Mixed precision uses lower-precision formats for much of the computation to reduce memory traffic and use specialized hardware. It can improve throughput and memory footprint when numerics are handled carefully.
Training often keeps some state or reductions in higher precision to preserve stability. Inference may use FP16, BF16, INT8 or lower formats depending on quality and hardware support. The right answer includes validation, not only speed.
Saying lower precision is always safe. Some models, layers or calibration settings can lose quality or stability.
Source / Inspiration: GPU quiz inspiration · CUDA guide
Tensor Cores are specialized GPU units for matrix multiply-accumulate operations in supported precisions. Deep learning frameworks use them through libraries and kernels when shapes and dtypes are compatible.
They are one reason mixed precision can be much faster on supported GPUs. But using Tensor Cores depends on hardware generation, dtype, layout and kernel choice, so check current vendor docs rather than memorizing a support matrix.
Assuming every matrix multiply uses Tensor Cores. Unsupported shapes or dtypes may use different execution paths.
Source / Inspiration: CUDA guide · GPU quiz inspiration
Separate model weights, activations, optimizer states, temporary buffers, fragmentation and serving caches. Then inspect batch size, sequence length, dtype, gradient checkpointing, sharding and allocator behavior.
Training and serving OOMs have different shapes. In training, activations and optimizer state are often large. In LLM serving, KV cache and variable requests can dominate. Profilers and memory summaries are more reliable than guesses.
Only reducing batch size without understanding which memory category grew.
Source / Inspiration: GPU quiz inspiration · Hugging Face KV cache docs
Use a timeline profiler to see CPU gaps, GPU kernels, memory copies and collectives. Then use kernel-level metrics when a specific kernel dominates.
Nsight Systems is useful for timeline and system-level bottlenecks; deeper kernel tools help inspect occupancy, memory throughput and instruction behavior. The practical goal is to decide whether the bottleneck is input, compute, memory, communication or scheduling.
Looking only at average GPU utilization. A workload can show high utilization while still spending time in inefficient kernels.
Source / Inspiration: NVIDIA Nsight Systems · GPU quiz inspiration
FlashAttention computes attention in tiles and avoids materializing the full attention matrix in high-bandwidth memory. It preserves exact attention while changing the IO pattern.
The key insight is IO awareness: moving data to and from GPU memory can dominate. By streaming blocks through faster memory and keeping numerically stable online softmax statistics, the algorithm reduces memory traffic and often improves speed.
Saying FlashAttention is an approximate attention method. It is exact attention with a different implementation strategy.
Source / Inspiration: FlashAttention paper · PyTorch SDPA docs
The container image includes user-space libraries, but the host driver and runtime expose the GPU. Mismatched driver/runtime expectations or missing device plugin configuration can break workloads.
In Kubernetes, the NVIDIA device plugin advertises GPU resources to pods. In local Docker, the NVIDIA container runtime exposes devices and libraries. Always separate image dependencies from node-level driver setup.
Assuming installing a CUDA toolkit inside the image is enough to access the GPU.
Source / Inspiration: NVIDIA Kubernetes device plugin · GPU quiz inspiration
MIG partitions supported NVIDIA GPUs into isolated GPU instances. It is useful when workloads need smaller, predictable slices instead of an entire GPU.
MIG can improve utilization for many small inference or notebook workloads, but it reduces flexibility and is hardware-specific. Scheduling must understand the exposed resources, and large jobs may prefer full GPUs.
Treating MIG as dynamic time-sharing. It creates partitioned instances with specific resource profiles.
Source / Inspiration: GPU quiz inspiration · NVIDIA Kubernetes device plugin
Compare achieved memory bandwidth, compute utilization and timeline behavior. If performance improves with better data reuse or fewer memory transactions, it is likely memory-bound; if math pipelines are saturated, it is compute-bound.
A roofline-style mental model is useful: arithmetic intensity determines whether memory movement or compute throughput limits performance. Many attention and reduction kernels are sensitive to memory layout and IO, while large GEMMs may be compute-heavy on tensor cores.
Guessing from operation name alone. The same operation can become memory- or compute-bound depending on shape, dtype and implementation.
Source / Inspiration: NVIDIA Nsight Systems · CUDA guide
Before an interview, you should be able to answer these without reading the page.
Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.