Overview

Transformer Interview Practice

Practice the system explanation inside the page.

Part of InfraLens Interview Practice

Use questions to train explanation, not memorization

This page uses the public ai-infra-engineer-learning curriculum as inspiration for question coverage. Answers are rewritten and reorganized for this site's handbook/interview format.

#
Reading method

Try to answer each question out loud first. Then open the answer and check whether you covered mechanism, why it matters, tradeoffs, common mistakes and the related handbook/lab.

Map

Question Map

Grouped by the kind of explanation the interview usually asks for.

#
Q&A

Q&A Cards

Each answer is intentionally short enough to rehearse, with deeper notes for follow-up questions.

01What are Q, K and V in attention?

Short Answer

Q is the query for what a token wants to read, K is the key used to match against queries, and V is the content that gets mixed after attention weights are computed. They are learned projections of token hidden states.

Deeper Explanation

The attention score is based on QK similarity, then softmax turns scores into weights over V. This separates addressing from content, which is why shape reasoning matters: queries choose positions, values provide information.

Common Mistake

Saying Q, K and V are three different input sequences in self-attention. In self-attention they usually come from the same hidden states through different projections.

Source / Inspiration: Attention paper · PyTorch SDPA docs

02Why does decoder-only attention need a causal mask?

Short Answer

The causal mask prevents a position from attending to future tokens. Training can compute all positions in parallel while preserving the autoregressive rule used at inference.

Deeper Explanation

Without the mask, the model would leak target tokens during next-token prediction. The mask changes which scores are visible before softmax, not the parameter count. It is central to explaining why training is parallel but generation is sequential.

Common Mistake

Thinking the model is trained one token at a time because generation is sequential.

Source / Inspiration: Attention paper · PyTorch SDPA docs

03What does multi-head attention add?

Short Answer

Multi-head attention runs multiple attention projections in parallel so different heads can attend to different patterns or subspaces. The heads are concatenated and projected back to the model dimension.

Deeper Explanation

The model dimension is split across heads, so each head has a smaller head dimension. This gives several attention views without multiplying the final hidden width by the number of heads. Production systems care because head layout affects KV cache shape and attention kernels.

Common Mistake

Saying each head is a separate model. Heads are parallel projections inside one layer.

Source / Inspiration: Attention paper · Hugging Face Transformers docs

04How do you reason about attention tensor shapes?

Short Answer

Start from hidden states `(batch, seq, model_dim)`. Q, K and V are projected and reshaped into heads, commonly `(batch, heads, seq, head_dim)`, then attention computes scores over query and key sequence positions.

Deeper Explanation

This shape path explains most implementation bugs. Score tensors scale with query length times key length, while KV cache scales with layers, heads, sequence length and head dimension. Shape reasoning also clarifies why prefill and decode differ.

Common Mistake

Memorizing formulas without tracking which dimension is sequence and which is head dimension.

Source / Inspiration: PyTorch SDPA docs · Hugging Face KV cache docs

05What problem does RoPE solve?

Short Answer

RoPE injects position information by rotating query and key vectors in a position-dependent way. It lets attention scores depend on relative position patterns without adding a learned position vector to hidden states.

Deeper Explanation

The key interview point is that attention itself is permutation-invariant unless position is encoded. RoPE modifies Q and K before the dot product, which makes relative offsets visible to attention. Long-context behavior still depends on implementation and training choices.

Common Mistake

Saying RoPE stores positions in the KV cache as separate embeddings. It changes Q/K representations.

Source / Inspiration: RoPE paper · Hugging Face Transformers docs

06Why is the FFN or MLP block important?

Short Answer

Attention mixes information across tokens, while the FFN transforms each token independently through learned nonlinear layers. It contributes a large share of Transformer parameters and compute.

Deeper Explanation

Modern LLMs often use gated variants such as SwiGLU. The FFN expands the hidden dimension, applies nonlinearity/gating and projects back. In interviews, say attention is communication and FFN is per-token computation.

Common Mistake

Treating attention as the whole Transformer. The MLP is usually a major part of capacity.

Source / Inspiration: Attention paper · Hugging Face Transformers docs

07What do residual connections and normalization do?

Short Answer

Residual connections preserve a path for information and gradients across layers. Normalization stabilizes activation scale so deep networks train more predictably.

Deeper Explanation

Many modern decoder models use pre-norm or RMSNorm-like variants, but the interview-level explanation is stable optimization. Residual plus norm lets layers learn changes to representations rather than rebuilding them from scratch.

Common Mistake

Describing normalization as only preventing overfitting. Its main role here is optimization stability.

Source / Inspiration: Attention paper · Hugging Face docs

08How is next-token training loss computed?

Short Answer

The model predicts logits for each position, and the labels are the input token ids shifted by one. Cross entropy trains the model to assign high probability to the actual next token.

Deeper Explanation

During training, all positions can be processed in parallel under the causal mask. Padding or ignored positions must be masked in the loss. This is different from inference, where the model repeatedly feeds back generated tokens.

Common Mistake

Forgetting the shift between inputs and labels. Predicting the current token would leak the answer.

Source / Inspiration: Hugging Face Transformers docs · Attention paper

09Why does KV cache make autoregressive inference faster?

Short Answer

KV cache stores previous keys and values so each decode step only computes attention inputs for the new token and reuses the prefix cache. This avoids recomputing all previous K/V projections every step.

Deeper Explanation

The speedup comes with memory cost. Cache size grows with layers, batch, heads, sequence length and head dimension. It also changes serving from pure compute to memory management and scheduling.

Common Mistake

Thinking KV cache stores generated text only. It stores tensor representations used by attention.

Source / Inspiration: Hugging Face KV cache docs · LLM quiz inspiration

10What is the difference between prefill and decode?

Short Answer

Prefill processes the prompt and builds the initial KV cache. Decode generates new tokens step by step, reusing that cache.

PhaseState changeUser-visible metricTypical bottleneck
PrefillBuilds KV cache for prompt tokens.TTFTLong-prompt attention compute.
DecodeReads and extends the cache token by token.TPOTCache bandwidth and scheduler fairness.

Deeper Explanation

Prefill is more like a full sequence pass and can be compute-heavy for long prompts. Decode often has small per-step computation but many repeated steps and cache reads. Serving metrics split these phases because they stress hardware differently.

Common Mistake

Using one average latency number for both phases. TTFT and token rate reveal different bottlenecks.

Source / Inspiration: Hugging Face KV cache docs · vLLM docs

11How do SDPA and FlashAttention relate?

Short Answer

SDPA is PyTorch's scaled-dot-product attention interface that can dispatch to different kernels. FlashAttention is an IO-aware exact attention algorithm often used as one backend when supported.

TermRoleBoundary
SDPAFramework API and dispatch point.Does not guarantee a particular backend.
FlashAttentionExact IO-aware attention algorithm/kernel backend.Requires supported shape, mask, dtype, and hardware.

Deeper Explanation

The high-level math is still scaled dot-product attention. The backend changes memory behavior, supported masks/dtypes and performance. In interviews, avoid claiming every SDPA call uses FlashAttention; dispatch depends on conditions and current docs.

Common Mistake

Equating API name with a specific kernel. The interface and backend are separate.

Source / Inspiration: PyTorch SDPA docs · FlashAttention paper

12Why does attention memory scale poorly with sequence length?

Short Answer

Naive attention scores scale with query length times key length. Doubling sequence length can roughly quadruple the score matrix size for full self-attention.

Deeper Explanation

This is why long-context models stress memory and why kernels avoid materializing large score matrices when possible. KV cache grows linearly with generated context, but training-time attention activations can have quadratic components.

Common Mistake

Mixing up parameter count with activation memory. Long sequences stress runtime memory even when parameters are unchanged.

Source / Inspiration: Attention paper · FlashAttention paper

13How does quantization affect Transformer inference?

Short Answer

Quantization reduces memory footprint and bandwidth by representing weights or activations with fewer bits. It can improve serving capacity, but quality and kernel support must be validated.

Deeper Explanation

For LLMs, weight memory, KV cache precision and compute kernels are separate concerns. A model can be weight-quantized but still use higher precision activations or cache. The right format depends on hardware and workload.

Common Mistake

Assuming a smaller checkpoint automatically means faster serving. Runtime kernels and memory bottlenecks decide actual latency.

Source / Inspiration: LLM quiz inspiration · Hugging Face Transformers docs

14Why is tokenization part of systems behavior?

Short Answer

Tokenization determines sequence length, vocabulary ids and special tokens. It affects memory, latency, truncation behavior and compatibility between model and serving code.

Deeper Explanation

Two prompts with similar character length can produce different token counts. In serving, token count drives prefill cost and cache growth. In training, tokenizer changes alter the dataset representation and model interface.

Common Mistake

Treating tokenizer as a frontend detail. It is part of the model contract.

Source / Inspiration: Hugging Face Transformers docs

15How would you explain attention backend choice in production?

Short Answer

Use the highest-level correct API first, then verify which backend is selected for your dtype, mask, hardware, installed packages, model implementation and sequence shape. Transformers exposes configurable attention implementations for supported models, but exact option names and defaults are version-sensitive.

Deeper Explanation

Production code should not rely on folklore. Check current framework docs, profile representative traffic and keep fallbacks. Long prompts, causal masks and mixed precision can change which kernel is usable.

Common Mistake

Hard-coding a backend assumption because a blog benchmark looked good on another GPU, or assuming one option list applies to all models and library versions.

Source / Inspiration: PyTorch SDPA docs · FlashAttention paper

16How would you describe a decoder-only Transformer block end to end?

Short Answer

A token embedding enters a stack of blocks. Each block normalizes, applies causal self-attention to read previous context, adds a residual, applies an FFN, adds another residual, and final logits predict the next token.

Deeper Explanation

This answer connects architecture to training and inference. Causal mask enforces autoregression, RoPE adds position information, FFN adds capacity, and KV cache makes decode efficient. It is the core oral explanation for Transformer interviews.

Common Mistake

Listing components without explaining how they work together.

Source / Inspiration: Attention paper · Hugging Face docs

Review

Final Review Checklist

Before an interview, you should be able to answer these without reading the page.

#
  • What are Q, K and V in attention?
  • Why does decoder-only attention need a causal mask?
  • What does multi-head attention add?
  • How do you reason about attention tensor shapes?
  • What problem does RoPE solve?
  • Why is the FFN or MLP block important?
  • What do residual connections and normalization do?
  • How is next-token training loss computed?
Sources

Sources and Further Reading

Official docs and papers are used for factual grounding; community/curriculum material is used for coverage and intuition.

#