Foundation
4 representative questions.
Practice the system explanation inside the page.
This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.
Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.
Grouped by the kind of explanation the interview usually asks for.
4 representative questions.
7 representative questions.
1 representative questions.
4 representative questions.
Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.
Q is the query for what a token wants to read, K is the key used to match against queries, and V is the content that gets mixed after attention weights are computed. They are learned projections of token hidden states.
The attention score is based on QK similarity, then softmax turns scores into weights over V. This separates addressing from content, which is why shape reasoning matters: queries choose positions, values provide information.
Saying Q, K and V are three different input sequences in self-attention. In self-attention they usually come from the same hidden states through different projections.
Source / Inspiration: Attention paper · PyTorch SDPA docs
The causal mask prevents a position from attending to future tokens. Training can compute all positions in parallel while preserving the autoregressive rule used at inference.
Without the mask, the model would leak target tokens during next-token prediction. The mask changes which scores are visible before softmax, not the parameter count. It is central to explaining why training is parallel but generation is sequential.
Thinking the model is trained one token at a time because generation is sequential.
Source / Inspiration: Attention paper · PyTorch SDPA docs
Multi-head attention runs multiple attention projections in parallel so different heads can attend to different patterns or subspaces. The heads are concatenated and projected back to the model dimension.
The model dimension is split across heads, so each head has a smaller head dimension. This gives several attention views without multiplying the final hidden width by the number of heads. Production systems care because head layout affects KV cache shape and attention kernels.
Saying each head is a separate model. Heads are parallel projections inside one layer.
Source / Inspiration: Attention paper · Hugging Face Transformers docs
Start from hidden states `(batch, seq, model_dim)`. Q, K and V are projected and reshaped into heads, commonly `(batch, heads, seq, head_dim)`, then attention computes scores over query and key sequence positions.
This shape path explains most implementation bugs. Score tensors scale with query length times key length, while KV cache scales with layers, heads, sequence length and head dimension. Shape reasoning also clarifies why prefill and decode differ.
Memorizing formulas without tracking which dimension is sequence and which is head dimension.
Source / Inspiration: PyTorch SDPA docs · Hugging Face KV cache docs
RoPE injects position information by rotating query and key vectors in a position-dependent way. It lets attention scores depend on relative position patterns without adding a learned position vector to hidden states.
The key interview point is that attention itself is permutation-invariant unless position is encoded. RoPE modifies Q and K before the dot product, which makes relative offsets visible to attention. Long-context behavior still depends on implementation and training choices.
Saying RoPE stores positions in the KV cache as separate embeddings. It changes Q/K representations.
Source / Inspiration: RoPE paper · Hugging Face Transformers docs
Attention mixes information across tokens, while the FFN transforms each token independently through learned nonlinear layers. It contributes a large share of Transformer parameters and compute.
Modern LLMs often use gated variants such as SwiGLU. The FFN expands the hidden dimension, applies nonlinearity/gating and projects back. In interviews, say attention is communication and FFN is per-token computation.
Treating attention as the whole Transformer. The MLP is usually a major part of capacity.
Source / Inspiration: Attention paper · Hugging Face Transformers docs
Residual connections preserve a path for information and gradients across layers. Normalization stabilizes activation scale so deep networks train more predictably.
Many modern decoder models use pre-norm or RMSNorm-like variants, but the interview-level explanation is stable optimization. Residual plus norm lets layers learn changes to representations rather than rebuilding them from scratch.
Describing normalization as only preventing overfitting. Its main role here is optimization stability.
Source / Inspiration: Attention paper · Hugging Face docs
The model predicts logits for each position, and the labels are the input token ids shifted by one. Cross entropy trains the model to assign high probability to the actual next token.
During training, all positions can be processed in parallel under the causal mask. Padding or ignored positions must be masked in the loss. This is different from inference, where the model repeatedly feeds back generated tokens.
Forgetting the shift between inputs and labels. Predicting the current token would leak the answer.
Source / Inspiration: Hugging Face Transformers docs · Attention paper
KV cache stores previous keys and values so each decode step only computes attention inputs for the new token and reuses the prefix cache. This avoids recomputing all previous K/V projections every step.
The speedup comes with memory cost. Cache size grows with layers, batch, heads, sequence length and head dimension. It also changes serving from pure compute to memory management and scheduling.
Thinking KV cache stores generated text only. It stores tensor representations used by attention.
Source / Inspiration: Hugging Face KV cache docs · LLM quiz inspiration
Prefill processes the prompt and builds the initial KV cache. Decode generates new tokens step by step, reusing that cache.
| Phase | State change | User-visible metric | Typical bottleneck |
|---|---|---|---|
| Prefill | Builds KV cache for prompt tokens. | TTFT | Long-prompt attention compute. |
| Decode | Reads and extends the cache token by token. | TPOT | Cache bandwidth and scheduler fairness. |
Prefill is more like a full sequence pass and can be compute-heavy for long prompts. Decode often has small per-step computation but many repeated steps and cache reads. Serving metrics split these phases because they stress hardware differently.
Using one average latency number for both phases. TTFT and token rate reveal different bottlenecks.
Source / Inspiration: Hugging Face KV cache docs · vLLM docs
SDPA is PyTorch's scaled-dot-product attention interface that can dispatch to different kernels. FlashAttention is an IO-aware exact attention algorithm often used as one backend when supported.
| Term | Role | Boundary |
|---|---|---|
| SDPA | Framework API and dispatch point. | Does not guarantee a particular backend. |
| FlashAttention | Exact IO-aware attention algorithm/kernel backend. | Requires supported shape, mask, dtype, and hardware. |
The high-level math is still scaled dot-product attention. The backend changes memory behavior, supported masks/dtypes and performance. In interviews, avoid claiming every SDPA call uses FlashAttention; dispatch depends on conditions and current docs.
Equating API name with a specific kernel. The interface and backend are separate.
Source / Inspiration: PyTorch SDPA docs · FlashAttention paper
Naive attention scores scale with query length times key length. Doubling sequence length can roughly quadruple the score matrix size for full self-attention.
This is why long-context models stress memory and why kernels avoid materializing large score matrices when possible. KV cache grows linearly with generated context, but training-time attention activations can have quadratic components.
Mixing up parameter count with activation memory. Long sequences stress runtime memory even when parameters are unchanged.
Source / Inspiration: Attention paper · FlashAttention paper
Quantization reduces memory footprint and bandwidth by representing weights or activations with fewer bits. It can improve serving capacity, but quality and kernel support must be validated.
For LLMs, weight memory, KV cache precision and compute kernels are separate concerns. A model can be weight-quantized but still use higher precision activations or cache. The right format depends on hardware and workload.
Assuming a smaller checkpoint automatically means faster serving. Runtime kernels and memory bottlenecks decide actual latency.
Source / Inspiration: LLM quiz inspiration · Hugging Face Transformers docs
Tokenization determines sequence length, vocabulary ids and special tokens. It affects memory, latency, truncation behavior and compatibility between model and serving code.
Two prompts with similar character length can produce different token counts. In serving, token count drives prefill cost and cache growth. In training, tokenizer changes alter the dataset representation and model interface.
Treating tokenizer as a frontend detail. It is part of the model contract.
Source / Inspiration: Hugging Face Transformers docs
Use the highest-level correct API first, then verify which backend is selected for your dtype, mask, hardware, installed packages, model implementation and sequence shape. Transformers exposes configurable attention implementations for supported models, but exact option names and defaults are version-sensitive.
Production code should not rely on folklore. Check current framework docs, profile representative traffic and keep fallbacks. Long prompts, causal masks and mixed precision can change which kernel is usable.
Hard-coding a backend assumption because a blog benchmark looked good on another GPU, or assuming one option list applies to all models and library versions.
Source / Inspiration: PyTorch SDPA docs · FlashAttention paper
A token embedding enters a stack of blocks. Each block normalizes, applies causal self-attention to read previous context, adds a residual, applies an FFN, adds another residual, and final logits predict the next token.
This answer connects architecture to training and inference. Causal mask enforces autoregression, RoPE adds position information, FFN adds capacity, and KV cache makes decode efficient. It is the core oral explanation for Transformer interviews.
Listing components without explaining how they work together.
Source / Inspiration: Attention paper · Hugging Face docs
Before an interview, you should be able to answer these without reading the page.
Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.