attention -> decoder + KV cache -> losses
-> decoding policy -> LoRA / MoE
-> implement, test, explain
LLM Whiteboard Practice
Use this sprint when an interview prompt starts with a blank editor and an LLM mechanism. The target answer is not a full training stack. It is an explicit tensor or probability contract, compact implementation logic, and a validation check you can say out loud.
QKV source
Attention(Q, K, V) = softmax((Q K^T + M) / sqrt(Dh)) V
M blocks future keys before softmax; Dh = D / H for ordinary MHA.
| Prompt shape | What to write first | Whiteboard check |
|---|---|---|
| Self-attention | Q, K, and V all project from the same hidden state X[B,S,D]. | You can name the input tensor for every projection. |
| Cross-attention | Q comes from the decoder state; K/V come from encoder or modality memory. | You do not accidentally reuse decoder K/V. |
| Multimodal bridge | Text queries may read visual K/V after projection into a compatible hidden dimension. | You state the bridge dimension before attention. |
Self-attention shape ledger
| Step | Tensor | Shape to say out loud |
|---|---|---|
| Input | X | [B, S, D] |
| Project and split heads | Q, K, V | [B, H, S, Dh] |
| Scores and mask | Q @ K.transpose(-2, -1) | [B, H, S, S] |
| Weighted values | softmax(scores) @ V | [B, H, S, Dh] |
| Merge and output | context.transpose(...).reshape(...) | [B, S, D] |
MHA, MQA, GQA, and MLA
| Variant | KV organization | Interview implication |
|---|---|---|
| MHA | KV heads match query heads. | Largest standard KV cache; easiest shape story. |
| MQA | One KV head is shared across query heads. | Small cache, but implementation must repeat or broadcast K/V for attention. |
| GQA | Groups of query heads share each KV head. | Middle ground used to reduce decode bandwidth and memory. |
| MLA | Cache stores a compressed latent representation plus architecture-specific reconstruction. | Do not describe it as only fewer KV heads; the cached representation changes. |
KV bytes = 2 * B * T * layers * H_kv * Dh * bytes_per_element
MQA and GQA reduce H_kv; MLA changes what representation is cached, so do not present it as a head-count edit alone.
RoPE, RMSNorm, SwiGLU, and the causal mask
RoPE
Rotate Q/K feature pairs before the dot product. In a whiteboard answer, state that RoPE changes positional geometry, not the KV-cache byte formula by itself.
RMSNorm
Normalize by root-mean-square magnitude over the hidden dimension, then apply a learned scale. It is enough for the sprint to preserve [B,S,D].
SwiGLU
Project into gate and value branches, multiply silu(gate) by the value branch, then project back to the model dimension.
Causal mask
Apply the keep/block rule before softmax. The reference example includes a no-leak smoke test: changing future tokens must not change prefix outputs.
Decoder block flow
x: [B,S,D]
x = x + CausalSelfAttention(RMSNorm(x))
x = x + SwiGLU(RMSNorm(x))
assert x.shape == [B,S,D]
Keep the block pre-norm and residual-first in the explanation. The sprint implementation intentionally omits training-only extras such as dropout so the shape and masking invariants stay visible.
KV-cache decode
KV bytes = 2 * B * T * layers * H_kv * Dh * bytes_per_element
Write which values represent active requests, retained prompt-plus-output tokens, KV heads, and dtype before calculating a capacity claim.
| Phase | State change | System caveat |
|---|---|---|
| Prefill | Process the prompt and create K/V for all prompt tokens. | Attention work is full-sequence and often compute-heavy. |
| Decode | Append one token's K/V and read retained history for the new query. | Decode often becomes cache-read and memory-bandwidth-heavy. |
| Validation | Compare full causal attention with step-by-step cached decode. | Cache equivalence must hold before optimizing layout. |
SFT, DPO, PPO / KL, and GRPO
Do not treat all post-training losses as interchangeable code. Start by naming whether the input is shifted language-model labels, preference pairs, or sampled reward-bearing responses.
SFT: CE(logits[:, :-1], tokens[:, 1:])
DPO: -log sigmoid(beta * ((log pi_chosen - log pi_rejected) - (log ref_chosen - log ref_rejected)))
GRPO: advantage_i = (reward_i - mean(group_rewards)) / std(group_rewards)
| Objective | State and core operation | Whiteboard validation |
|---|---|---|
| SFT / cross entropy | Shift tokens: input_ids = tokens[:-1], labels = tokens[1:]; score one vocabulary distribution per position. | Assert shifted lengths and ignore padded label positions. |
| DPO | Use chosen/rejected policy log-probability margin minus the frozen reference margin inside a logistic loss. | A stronger chosen policy margin should lower loss. |
| PPO with KL control | Optimize sampled rewards while penalizing drift from a reference policy; rollout state matters. | Name reward, reference logprobs, and KL monitoring. |
| GRPO | Normalize verifier or reward values within a response group for the same prompt. | Group-relative advantages should be centered within each group. |
Greedy, temperature, top-k, top-p, and beam search
p_T(i) = exp(z_i / T) / sum_j exp(z_j / T)
top-k: keep the k largest logits; top-p: keep the sorted prefix whose cumulative mass first reaches p
beam_score(y_1:t) = sum_u log p(y_u | y_<u, x)
Temperature changes sharpness; top-k fixes candidate count; top-p adapts candidate count to distribution mass; beam search accumulates sequence scores.
| Policy | Implementation move | Failure or tradeoff to mention |
|---|---|---|
| Greedy | argmax(logits) | Deterministic and cheap, but can be repetitive or locally shortsighted. |
| Temperature | Divide logits by T before softmax. | Low T sharpens; high T increases randomness. |
| Top-k / top-p | Mask unwanted logits to negative infinity, then sample. | Fixed count and probability mass are different policies. |
| Beam search | Retain highest-scoring partial sequences at each step. | Adds branching/cache work and is not universally best for open-ended text. |
These rules choose output tokens from model logits. Speculative decoding instead proposes and verifies tokens while preserving its stated target-policy correctness condition.
LoRA adaptation and MoE routing
| Extension | Minimal code contract | Do not confuse it with |
|---|---|---|
| LoRA | Freeze W0; add (alpha / r) * B(A(x)); test that merged W0 + (alpha / r) * B @ A matches. | A smaller base model or conditional expert routing. |
| MoE | Compute router probabilities, choose top-k experts per token, run selected FFNs, and combine by gate weights. | Parameter-efficient fine-tuning; MoE changes per-token execution. |
W' = W0 + (alpha / r) * B @ A
y = x @ W0.T + (alpha / r) * x @ A.T @ B.T
rank(delta_W) <= r and trainable = r * (d_in + d_out) for one bias-free adapted projection.
y_token = sum_{e in top_k(router(x))} gate_e(x) * Expert_e(x)
Report active experts per token, capacity or dropped-token behavior, and whether expert traffic crosses devices.
For MoE, conclude with load balance and expert-parallel communication; sparse active compute is not free capacity. For LoRA, state target modules and whether adapters are merged or dynamically served.
How to say the answer out loud
- Name the state. Say whether you are transforming hidden states, cache tensors, logits, preference logprobs, adapter weights, or routed tokens.
- State the contract. Give the tensor shape, probability rule, loss inputs, or trainable-parameter formula before coding.
- Write the decisive operation. Apply the causal mask before softmax, shift SFT labels, filter logits before sampling, merge a LoRA delta, or top-k route experts.
- Name one failure. Examples are future leakage, cache-shape confusion, reward/reference mix-ups, over-broad sampling, incorrect LoRA merge, or expert imbalance.
- Connect to systems cost. Mention sequence length, KV heads, dtype bytes, beam width, rollout/reference passes, adapter state, or all-to-all traffic.
- Finish with validation. Use no-leak, cache equivalence, centered advantages, normalized filtered probabilities, merged-weight equivalence, or expert-load checks.
Topic map, not vendored content
This page is clean-room InfraLens material inspired by the public topic structure of NashKnight/LLM-Whiteboard. External Markdown and PDF content are not copied into this repository.
