Overview

GPU Systems Interview Practice

Practice the system explanation inside the page.

Part of InfraLens Interview Practice

Use questions to train explanation, not memorization

This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.

#
Reading method

Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.

Map

Question Map

Grouped by the kind of explanation the interview usually asks for.

#
Q&A

Q&A Cards

Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.

01Why are GPUs useful for ML workloads?

Short Answer

GPUs run many simple operations in parallel and have high memory bandwidth, which fits dense tensor operations. They are strongest when work is batched and expressed as regular kernels.

Deeper Explanation

A CPU is optimized for low-latency general-purpose control flow; a GPU is optimized for throughput. ML training and inference spend much of their time in matrix multiplies, convolutions and elementwise kernels, which map well to GPU execution.

Common Mistake

Saying GPUs are always faster. Small batches, branching-heavy code or data stalls can leave the GPU underused.

Source / Inspiration: GPU module quiz inspiration · CUDA C Programming Guide

02What is CUDA in the AI infra stack?

Short Answer

CUDA is NVIDIA's programming and runtime platform for GPU computing. Frameworks such as PyTorch use CUDA libraries and kernels under the hood to execute tensor operations.

Deeper Explanation

Interview answers should distinguish CUDA source APIs, CUDA runtime, libraries such as cuBLAS/cuDNN, and the host driver. Most ML engineers do not write every kernel, but they must understand enough to debug memory, compatibility and performance issues.

Common Mistake

Treating CUDA as only a Python package. It spans driver/runtime/libraries and hardware execution.

Source / Inspiration: GPU module quiz inspiration · CUDA guide

03How do you explain CUDA memory hierarchy?

Short Answer

Global memory is large but slower; shared memory is smaller and shared within a block; registers are fastest and private to threads. Caches sit between these layers and affect observed performance.

Deeper Explanation

The hierarchy matters because many kernels are memory-bandwidth bound. Good kernels improve locality, reuse data in shared memory or registers and avoid unnecessary global reads/writes. The exact layout depends on hardware, so check current NVIDIA docs for details.

Common Mistake

Only saying GPU memory is VRAM. Kernel performance depends on where data is accessed inside the hierarchy.

Source / Inspiration: CUDA memory hierarchy · GPU quiz inspiration

04What is memory coalescing?

Short Answer

Coalescing means adjacent threads access adjacent memory so the hardware can combine requests efficiently. It improves bandwidth use for global memory loads and stores.

Deeper Explanation

In tensor code, layout and stride decide whether neighboring threads touch neighboring addresses. Non-contiguous views, transposes and irregular indexing can destroy coalescing. This is why shape and stride awareness matters even in high-level frameworks.

Common Mistake

Thinking coalescing changes mathematical results. It changes memory transaction efficiency.

Source / Inspiration: CUDA memory coalescing

05What is shared memory, and why can it be faster?

Short Answer

Shared memory is on-chip memory visible to threads in the same block. It is useful for reusing data that would otherwise be repeatedly loaded from global memory.

Deeper Explanation

Shared memory is explicitly managed by the kernel. It can accelerate tiled matrix operations, reductions and stencil-like access patterns, but capacity is limited and synchronization is needed. Poor use can reduce occupancy or create bank conflicts.

Common Mistake

Assuming shared memory is automatically used for all tensors. Kernels must be written to use it.

Source / Inspiration: CUDA shared memory

06What is a shared memory bank conflict?

Short Answer

A bank conflict happens when threads in a warp access shared memory addresses that map to the same bank in a conflicting pattern. The hardware serializes those accesses, reducing throughput.

Deeper Explanation

The fix is often to change layout, add padding or alter access pattern. In interviews, connect bank conflicts to why low-level kernels care about indexing details that are invisible in Python code.

Common Mistake

Confusing bank conflicts with race conditions. Bank conflicts hurt performance; races hurt correctness.

Source / Inspiration: CUDA shared memory banks · GPU quiz inspiration

07What is occupancy, and why is it not the only goal?

Short Answer

Occupancy measures how many warps can be resident on an SM relative to the hardware maximum. Higher occupancy can help hide latency, but it is not automatically better; register pressure, shared memory usage, memory bandwidth, instruction mix and data locality can dominate performance.

Deeper Explanation

Register use, shared memory use, memory bandwidth, instruction mix and tensor core utilization all matter. Sometimes a kernel with lower occupancy but more data reuse wins. Profiling should guide tuning.

Common Mistake

Optimizing occupancy as a standalone metric. It is a signal, not the objective.

Source / Inspiration: CUDA occupancy · NVIDIA Nsight Systems

08What are CUDA streams used for?

Short Answer

CUDA streams order work on a GPU. Separate streams can allow overlapping independent kernels or memory transfers when dependencies and hardware permit.

Deeper Explanation

Streams are useful for pipelining data transfers, preprocessing and compute, but they require correct synchronization. Frameworks manage many stream details, yet custom extensions and serving systems can still hit stream-related bugs.

Common Mistake

Assuming different streams always run concurrently. Dependencies and hardware resources decide actual overlap.

Source / Inspiration: CUDA streams · GPU quiz inspiration

09Why is mixed precision important?

Short Answer

Mixed precision uses lower-precision formats for much of the computation to reduce memory traffic and use specialized hardware. It can improve throughput and memory footprint when numerics are handled carefully.

Deeper Explanation

Training often keeps some state or reductions in higher precision to preserve stability. Inference may use FP16, BF16, INT8 or lower formats depending on quality and hardware support. The right answer includes validation, not only speed.

Common Mistake

Saying lower precision is always safe. Some models, layers or calibration settings can lose quality or stability.

Source / Inspiration: GPU quiz inspiration · CUDA guide

10What are Tensor Cores?

Short Answer

Tensor Cores are specialized GPU units for matrix multiply-accumulate operations in supported precisions. Deep learning frameworks use them through libraries and kernels when shapes and dtypes are compatible.

Deeper Explanation

They are one reason mixed precision can be much faster on supported GPUs. But using Tensor Cores depends on hardware generation, dtype, layout and kernel choice, so check current vendor docs rather than memorizing a support matrix.

Common Mistake

Assuming every matrix multiply uses Tensor Cores. Unsupported shapes or dtypes may use different execution paths.

Source / Inspiration: CUDA guide · GPU quiz inspiration

11How do you debug GPU out-of-memory?

Short Answer

Separate model weights, activations, optimizer states, temporary buffers, fragmentation and serving caches. Then inspect batch size, sequence length, dtype, gradient checkpointing, sharding and allocator behavior.

Deeper Explanation

Training and serving OOMs have different shapes. In training, activations and optimizer state are often large. In LLM serving, KV cache and variable requests can dominate. Profilers and memory summaries are more reliable than guesses.

Common Mistake

Only reducing batch size without understanding which memory category grew.

Source / Inspiration: GPU quiz inspiration · Hugging Face KV cache docs

12How would you profile a GPU workload?

Short Answer

Use a timeline profiler to see CPU gaps, GPU kernels, memory copies and collectives. Then use kernel-level metrics when a specific kernel dominates.

Deeper Explanation

Nsight Systems is useful for timeline and system-level bottlenecks; deeper kernel tools help inspect occupancy, memory throughput and instruction behavior. The practical goal is to decide whether the bottleneck is input, compute, memory, communication or scheduling.

Common Mistake

Looking only at average GPU utilization. A workload can show high utilization while still spending time in inefficient kernels.

Source / Inspiration: NVIDIA Nsight Systems · GPU quiz inspiration

13Why does FlashAttention reduce attention memory pressure?

Short Answer

FlashAttention computes attention in tiles and avoids materializing the full attention matrix in high-bandwidth memory. It preserves exact attention while changing the IO pattern.

Deeper Explanation

The key insight is IO awareness: moving data to and from GPU memory can dominate. By streaming blocks through faster memory and keeping numerically stable online softmax statistics, the algorithm reduces memory traffic and often improves speed.

Common Mistake

Saying FlashAttention is an approximate attention method. It is exact attention with a different implementation strategy.

Source / Inspiration: FlashAttention paper · PyTorch SDPA docs

14What compatibility issues appear with CUDA in containers?

Short Answer

The container image includes user-space libraries, but the host driver and runtime expose the GPU. Mismatched driver/runtime expectations or missing device plugin configuration can break workloads.

Deeper Explanation

In Kubernetes, the NVIDIA device plugin advertises GPU resources to pods. In local Docker, the NVIDIA container runtime exposes devices and libraries. Always separate image dependencies from node-level driver setup.

Common Mistake

Assuming installing a CUDA toolkit inside the image is enough to access the GPU.

Source / Inspiration: NVIDIA Kubernetes device plugin · GPU quiz inspiration

15What is MIG, and when is it useful?

Short Answer

MIG partitions supported NVIDIA GPUs into isolated GPU instances. It is useful when workloads need smaller, predictable slices instead of an entire GPU.

Deeper Explanation

MIG can improve utilization for many small inference or notebook workloads, but it reduces flexibility and is hardware-specific. Scheduling must understand the exposed resources, and large jobs may prefer full GPUs.

Common Mistake

Treating MIG as dynamic time-sharing. It creates partitioned instances with specific resource profiles.

Source / Inspiration: GPU quiz inspiration · NVIDIA Kubernetes device plugin

16How do you decide whether a kernel is memory-bound or compute-bound?

Short Answer

Compare achieved memory bandwidth, compute utilization and timeline behavior. If performance improves with better data reuse or fewer memory transactions, it is likely memory-bound; if math pipelines are saturated, it is compute-bound.

Deeper Explanation

A roofline-style mental model is useful: arithmetic intensity determines whether memory movement or compute throughput limits performance. Many attention and reduction kernels are sensitive to memory layout and IO, while large GEMMs may be compute-heavy on tensor cores.

Common Mistake

Guessing from operation name alone. The same operation can become memory- or compute-bound depending on shape, dtype and implementation.

Source / Inspiration: NVIDIA Nsight Systems · CUDA guide

Review

Final Review Checklist

Before an interview, you should be able to answer these without reading the page.

#
  • Why are GPUs useful for ML workloads?
  • What is CUDA in the AI infra stack?
  • How do you explain CUDA memory hierarchy?
  • What is memory coalescing?
  • What is shared memory, and why can it be faster?
  • What is a shared memory bank conflict?
  • What is occupancy, and why is it not the only goal?
  • What are CUDA streams used for?
Sources

Sources and Further Reading

Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.

#