Implementation Practice

Coding and Estimation Interview Practice

Write the tensor or resource ledger before reaching for a framework API or optimization name.

#
Q&A

Q&A Cards

01How do you estimate per-rank training state under sharding?

Short Answer

Account separately for parameter bytes, gradient bytes, and optimizer-state bytes. Divide only the state sharded at the selected ZeRO/FSDP assumption by data-parallel degree; then explicitly exclude activations and temporary buffers unless separately modeled.

02What assumptions belong in a communication estimate?

Short Answer

Name payload bytes, collective type, world size, effective bandwidth, topology/contending traffic assumption, and overlap assumption. A bytes-over-bandwidth result is a lower bound, not measured step time.

Source: NCCL collectives

03Why is an estimated speculative speedup insufficient?

Short Answer

Acceptance depends on workload and draft quality, and a deployment must include draft overhead, target verification, sampling correctness, batching divergence, and KV rollback behavior in measurement.

04How should a system-design coding answer end?

Short Answer

End with observability and an invalidation test: which metrics verify the claimed bottleneck, which failure path is recoverable, and what result would make you change the design.

05How would you live-code a minimal decoder block?

Short Answer

Start from x: [B,S,D]. Apply pre-norm attention with Q/K/V split into [B,H,S,Dh], mask future positions before softmax, merge heads back to [B,S,D], then add a pre-norm SwiGLU feed-forward residual. Validate with shape assertions, finite outputs, and a causal no-leak test.

06What changes between MHA, MQA, GQA, and MLA in a whiteboard answer?

Short Answer

MHA stores K/V per query head, MQA shares one K/V head across all query heads, GQA shares K/V by groups, and MLA stores a compressed latent representation. Connect that shape change to KV-cache bytes, decode bandwidth, and implementation compatibility.

VariantCached representationSystems tradeoff
MHAK/V for every query head.Largest KV cache; full head-specific representation.
MQAOne shared K/V head.Smallest KV cache; strongest sharing constraint.
GQAK/V shared within query-head groups.Middle ground for cache bytes and capacity.
MLACompressed latent cache state.Lower residency with architecture-specific decode support.
07How do SFT, DPO, PPO / KL, and GRPO differ in code?

Short Answer

SFT uses shifted next-token labels and cross entropy. DPO compares chosen-versus-rejected log-probability margins against a frozen reference. PPO-style training uses sampled rewards with policy-control terms such as KL, while GRPO derives relative advantages within groups of sampled responses. Name the data contract before writing the loss.

ObjectiveData contractCode-level distinction
SFTPrompt and target-token sequence.Shifted labels with cross-entropy loss.
DPOChosen/rejected response pair and reference model.Optimizes a preference margin against reference log-probabilities.
PPO / KLSampled rollout, reward, and reference control.Policy update with reward/advantage and KL constraint.
GRPOGroup of sampled responses for one prompt.Computes relative advantages within the group.
08How do ordinary decoding strategies differ from speculative decoding?

Short Answer

Greedy, temperature, top-k, top-p, and beam search define how logits become selected tokens. Speculative decoding is an execution optimization that proposes tokens and verifies them under a target-policy correctness rule. Do not present nucleus sampling as draft verification.

Technique classWhat changesCorrectness or quality implication
Greedy / sampling / beamThe selection policy applied to target-model logits.Changes diversity or search behavior.
Speculative decodingExecution path: draft proposals verified by the target model.Must preserve the target policy while measuring acceptance and overhead.
09Why are LoRA and MoE not interchangeable extensions?

Short Answer

LoRA learns a low-rank delta on selected frozen weights, reducing trainable adapter state and supporting a merge-equivalence check. MoE changes forward execution by routing each token to selected expert FFNs, creating capacity, load-balance, and communication tradeoffs.

ExtensionWhat changesPrimary systems cost
LoRAAdds trained low-rank weight deltas.Adapter residency, switching, or merge policy.
MoERoutes tokens through selected expert FFNs.Expert capacity, balancing, and all-to-all traffic.