Coding Practice Interview

01How do you estimate per-rank training state under sharding?

Short Answer

Account separately for parameter bytes, gradient bytes, and optimizer-state bytes. Divide only the state sharded at the selected ZeRO/FSDP assumption by data-parallel degree; then explicitly exclude activations and temporary buffers unless separately modeled.

02What assumptions belong in a communication estimate?

Short Answer

Name payload bytes, collective type, world size, effective bandwidth, topology/contending traffic assumption, and overlap assumption. A bytes-over-bandwidth result is a lower bound, not measured step time.

Source: NCCL collectives

03Why is an estimated speculative speedup insufficient?

Short Answer

Acceptance depends on workload and draft quality, and a deployment must include draft overhead, target verification, sampling correctness, batching divergence, and KV rollback behavior in measurement.

04How should a system-design coding answer end?

Short Answer

End with observability and an invalidation test: which metrics verify the claimed bottleneck, which failure path is recoverable, and what result would make you change the design.

05How would you live-code a minimal decoder block?

Short Answer

Start from x: [B,S,D]. Apply pre-norm attention with Q/K/V split into [B,H,S,Dh], mask future positions before softmax, merge heads back to [B,S,D], then add a pre-norm SwiGLU feed-forward residual. Validate with shape assertions, finite outputs, and a causal no-leak test.

06What changes between MHA, MQA, GQA, and MLA in a whiteboard answer?

Short Answer

MHA stores K/V per query head, MQA shares one K/V head across all query heads, GQA shares K/V by groups, and MLA stores a compressed latent representation. Connect that shape change to KV-cache bytes, decode bandwidth, and implementation compatibility.

Variant	Cached representation	Systems tradeoff
MHA	K/V for every query head.	Largest KV cache; full head-specific representation.
MQA	One shared K/V head.	Smallest KV cache; strongest sharing constraint.
GQA	K/V shared within query-head groups.	Middle ground for cache bytes and capacity.
MLA	Compressed latent cache state.	Lower residency with architecture-specific decode support.

07How do SFT, DPO, PPO / KL, and GRPO differ in code?

Short Answer

SFT uses shifted next-token labels and cross entropy. DPO compares chosen-versus-rejected log-probability margins against a frozen reference. PPO-style training uses sampled rewards with policy-control terms such as KL, while GRPO derives relative advantages within groups of sampled responses. Name the data contract before writing the loss.

Objective	Data contract	Code-level distinction
SFT	Prompt and target-token sequence.	Shifted labels with cross-entropy loss.
DPO	Chosen/rejected response pair and reference model.	Optimizes a preference margin against reference log-probabilities.
PPO / KL	Sampled rollout, reward, and reference control.	Policy update with reward/advantage and KL constraint.
GRPO	Group of sampled responses for one prompt.	Computes relative advantages within the group.

08How do ordinary decoding strategies differ from speculative decoding?

Short Answer

Greedy, temperature, top-k, top-p, and beam search define how logits become selected tokens. Speculative decoding is an execution optimization that proposes tokens and verifies them under a target-policy correctness rule. Do not present nucleus sampling as draft verification.

Technique class	What changes	Correctness or quality implication
Greedy / sampling / beam	The selection policy applied to target-model logits.	Changes diversity or search behavior.
Speculative decoding	Execution path: draft proposals verified by the target model.	Must preserve the target policy while measuring acceptance and overhead.

09Why are LoRA and MoE not interchangeable extensions?

Short Answer

LoRA learns a low-rank delta on selected frozen weights, reducing trainable adapter state and supporting a merge-equivalence check. MoE changes forward execution by routing each token to selected expert FFNs, creating capacity, load-balance, and communication tradeoffs.

Extension	What changes	Primary systems cost
LoRA	Adds trained low-rank weight deltas.	Adapter residency, switching, or merge policy.
MoE	Routes tokens through selected expert FFNs.	Expert capacity, balancing, and all-to-all traffic.

InfraLens

Coding and Estimation Interview Practice

Q&A Cards

Short Answer

Short Answer

Short Answer

Short Answer

Short Answer

Short Answer

Short Answer

Short Answer

Short Answer