Coding and Estimation Interview Practice
Write the tensor or resource ledger before reaching for a framework API or optimization name.
Q&A Cards
01How do you estimate per-rank training state under sharding?
Short Answer
Account separately for parameter bytes, gradient bytes, and optimizer-state bytes. Divide only the state sharded at the selected ZeRO/FSDP assumption by data-parallel degree; then explicitly exclude activations and temporary buffers unless separately modeled.
02What assumptions belong in a communication estimate?
Short Answer
Name payload bytes, collective type, world size, effective bandwidth, topology/contending traffic assumption, and overlap assumption. A bytes-over-bandwidth result is a lower bound, not measured step time.
Source: NCCL collectives
03Why is an estimated speculative speedup insufficient?
Short Answer
Acceptance depends on workload and draft quality, and a deployment must include draft overhead, target verification, sampling correctness, batching divergence, and KV rollback behavior in measurement.
04How should a system-design coding answer end?
Short Answer
End with observability and an invalidation test: which metrics verify the claimed bottleneck, which failure path is recoverable, and what result would make you change the design.
05How would you live-code a minimal decoder block?
Short Answer
Start from x: [B,S,D]. Apply pre-norm attention with Q/K/V split into [B,H,S,Dh], mask future positions before softmax, merge heads back to [B,S,D], then add a pre-norm SwiGLU feed-forward residual. Validate with shape assertions, finite outputs, and a causal no-leak test.
06What changes between MHA, MQA, GQA, and MLA in a whiteboard answer?
Short Answer
MHA stores K/V per query head, MQA shares one K/V head across all query heads, GQA shares K/V by groups, and MLA stores a compressed latent representation. Connect that shape change to KV-cache bytes, decode bandwidth, and implementation compatibility.
| Variant | Cached representation | Systems tradeoff |
|---|---|---|
| MHA | K/V for every query head. | Largest KV cache; full head-specific representation. |
| MQA | One shared K/V head. | Smallest KV cache; strongest sharing constraint. |
| GQA | K/V shared within query-head groups. | Middle ground for cache bytes and capacity. |
| MLA | Compressed latent cache state. | Lower residency with architecture-specific decode support. |
07How do SFT, DPO, PPO / KL, and GRPO differ in code?
Short Answer
SFT uses shifted next-token labels and cross entropy. DPO compares chosen-versus-rejected log-probability margins against a frozen reference. PPO-style training uses sampled rewards with policy-control terms such as KL, while GRPO derives relative advantages within groups of sampled responses. Name the data contract before writing the loss.
| Objective | Data contract | Code-level distinction |
|---|---|---|
| SFT | Prompt and target-token sequence. | Shifted labels with cross-entropy loss. |
| DPO | Chosen/rejected response pair and reference model. | Optimizes a preference margin against reference log-probabilities. |
| PPO / KL | Sampled rollout, reward, and reference control. | Policy update with reward/advantage and KL constraint. |
| GRPO | Group of sampled responses for one prompt. | Computes relative advantages within the group. |
08How do ordinary decoding strategies differ from speculative decoding?
Short Answer
Greedy, temperature, top-k, top-p, and beam search define how logits become selected tokens. Speculative decoding is an execution optimization that proposes tokens and verifies them under a target-policy correctness rule. Do not present nucleus sampling as draft verification.
| Technique class | What changes | Correctness or quality implication |
|---|---|---|
| Greedy / sampling / beam | The selection policy applied to target-model logits. | Changes diversity or search behavior. |
| Speculative decoding | Execution path: draft proposals verified by the target model. | Must preserve the target policy while measuring acceptance and overhead. |
09Why are LoRA and MoE not interchangeable extensions?
Short Answer
LoRA learns a low-rank delta on selected frozen weights, reducing trainable adapter state and supporting a merge-equivalence check. MoE changes forward execution by routing each token to selected expert FFNs, creating capacity, load-balance, and communication tradeoffs.
| Extension | What changes | Primary systems cost |
|---|---|---|
| LoRA | Adds trained low-rank weight deltas. | Adapter residency, switching, or merge policy. |
| MoE | Routes tokens through selected expert FFNs. | Expert capacity, balancing, and all-to-all traffic. |
