LLM Whiteboard Practice

LLM Live Coding

LLM Whiteboard Practice

Use this sprint when an interview prompt starts with a blank editor and an LLM mechanism. The target answer is not a full training stack. It is an explicit tensor or probability contract, compact implementation logic, and a validation check you can say out loud.

Primary path

attention -> decoder + KV cache -> losses
       -> decoding policy -> LoRA / MoE
       -> implement, test, explain

Open next

Runnable primitive example Decoding strategies example LoRA example Attention concept KV Cache concept

Module 01

QKV source

Scaled dot-product attention

Attention(Q, K, V) = softmax((Q K^T + M) / sqrt(Dh)) V

M blocks future keys before softmax; Dh = D / H for ordinary MHA.

Prompt shape	What to write first	Whiteboard check
Self-attention	Q, K, and V all project from the same hidden state `X[B,S,D]`.	You can name the input tensor for every projection.
Cross-attention	Q comes from the decoder state; K/V come from encoder or modality memory.	You do not accidentally reuse decoder K/V.
Multimodal bridge	Text queries may read visual K/V after projection into a compatible hidden dimension.	You state the bridge dimension before attention.

Self-attention shape ledger

Step	Tensor	Shape to say out loud
Input	`X`	`[B, S, D]`
Project and split heads	`Q, K, V`	`[B, H, S, Dh]`
Scores and mask	`Q @ K.transpose(-2, -1)`	`[B, H, S, S]`
Weighted values	`softmax(scores) @ V`	`[B, H, S, Dh]`
Merge and output	`context.transpose(...).reshape(...)`	`[B, S, D]`

Module 02

MHA, MQA, GQA, and MLA

Variant	KV organization	Interview implication
MHA	KV heads match query heads.	Largest standard KV cache; easiest shape story.
MQA	One KV head is shared across query heads.	Small cache, but implementation must repeat or broadcast K/V for attention.
GQA	Groups of query heads share each KV head.	Middle ground used to reduce decode bandwidth and memory.
MLA	Cache stores a compressed latent representation plus architecture-specific reconstruction.	Do not describe it as only fewer KV heads; the cached representation changes.

Cache term affected by the variant

KV bytes = 2 * B * T * layers * H_kv * Dh * bytes_per_element

MQA and GQA reduce H_kv; MLA changes what representation is cached, so do not present it as a head-count edit alone.

Module 03

RoPE, RMSNorm, SwiGLU, and the causal mask

RoPE

Rotate Q/K feature pairs before the dot product. In a whiteboard answer, state that RoPE changes positional geometry, not the KV-cache byte formula by itself.

RMSNorm

Normalize by root-mean-square magnitude over the hidden dimension, then apply a learned scale. It is enough for the sprint to preserve [B,S,D].

SwiGLU

Project into gate and value branches, multiply silu(gate) by the value branch, then project back to the model dimension.

Causal mask

Apply the keep/block rule before softmax. The reference example includes a no-leak smoke test: changing future tokens must not change prefix outputs.

Module 04

Decoder block flow

x: [B,S,D]
x = x + CausalSelfAttention(RMSNorm(x))
x = x + SwiGLU(RMSNorm(x))
assert x.shape == [B,S,D]

Keep the block pre-norm and residual-first in the explanation. The sprint implementation intentionally omits training-only extras such as dropout so the shape and masking invariants stay visible.

Module 05

KV-cache decode

Memory ledger

KV bytes = 2 * B * T * layers * H_kv * Dh * bytes_per_element

Write which values represent active requests, retained prompt-plus-output tokens, KV heads, and dtype before calculating a capacity claim.

Phase	State change	System caveat
Prefill	Process the prompt and create K/V for all prompt tokens.	Attention work is full-sequence and often compute-heavy.
Decode	Append one token's K/V and read retained history for the new query.	Decode often becomes cache-read and memory-bandwidth-heavy.
Validation	Compare full causal attention with step-by-step cached decode.	Cache equivalence must hold before optimizing layout.

Module 06

SFT, DPO, PPO / KL, and GRPO

Do not treat all post-training losses as interchangeable code. Start by naming whether the input is shifted language-model labels, preference pairs, or sampled reward-bearing responses.

Loss contracts to write before code

SFT: CE(logits[:, :-1], tokens[:, 1:])

DPO: -log sigmoid(beta * ((log pi_chosen - log pi_rejected) - (log ref_chosen - log ref_rejected)))

GRPO: advantage_i = (reward_i - mean(group_rewards)) / std(group_rewards)

Objective	State and core operation	Whiteboard validation
SFT / cross entropy	Shift tokens: `input_ids = tokens[:-1]`, `labels = tokens[1:]`; score one vocabulary distribution per position.	Assert shifted lengths and ignore padded label positions.
DPO	Use chosen/rejected policy log-probability margin minus the frozen reference margin inside a logistic loss.	A stronger chosen policy margin should lower loss.
PPO with KL control	Optimize sampled rewards while penalizing drift from a reference policy; rollout state matters.	Name reward, reference logprobs, and KL monitoring.
GRPO	Normalize verifier or reward values within a response group for the same prompt.	Group-relative advantages should be centered within each group.

Module 07

Greedy, temperature, top-k, top-p, and beam search

Inference-time distribution transform

p_T(i) = exp(z_i / T) / sum_j exp(z_j / T)

top-k: keep the k largest logits; top-p: keep the sorted prefix whose cumulative mass first reaches p

beam_score(y_1:t) = sum_u log p(y_u | y_<u, x)

Temperature changes sharpness; top-k fixes candidate count; top-p adapts candidate count to distribution mass; beam search accumulates sequence scores.

Policy	Implementation move	Failure or tradeoff to mention
Greedy	`argmax(logits)`	Deterministic and cheap, but can be repetitive or locally shortsighted.
Temperature	Divide logits by `T` before softmax.	Low `T` sharpens; high `T` increases randomness.
Top-k / top-p	Mask unwanted logits to negative infinity, then sample.	Fixed count and probability mass are different policies.
Beam search	Retain highest-scoring partial sequences at each step.	Adds branching/cache work and is not universally best for open-ended text.

Separate policy from acceleration

These rules choose output tokens from model logits. Speculative decoding instead proposes and verifies tokens while preserving its stated target-policy correctness condition.

Module 08

LoRA adaptation and MoE routing

Extension	Minimal code contract	Do not confuse it with
LoRA	Freeze `W0`; add `(alpha / r) * B(A(x))`; test that merged `W0 + (alpha / r) * B @ A` matches.	A smaller base model or conditional expert routing.
MoE	Compute router probabilities, choose top-k experts per token, run selected FFNs, and combine by gate weights.	Parameter-efficient fine-tuning; MoE changes per-token execution.

LoRA parameter ledger

W' = W0 + (alpha / r) * B @ A

y = x @ W0.T + (alpha / r) * x @ A.T @ B.T

rank(delta_W) <= r and trainable = r * (d_in + d_out) for one bias-free adapted projection.

MoE routing ledger

y_token = sum_{e in top_k(router(x))} gate_e(x) * Expert_e(x)

Report active experts per token, capacity or dropped-token behavior, and whether expert traffic crosses devices.

For MoE, conclude with load balance and expert-parallel communication; sparse active compute is not free capacity. For LoRA, state target modules and whether adapters are merged or dynamically served.

Spoken Checklist

How to say the answer out loud

Name the state. Say whether you are transforming hidden states, cache tensors, logits, preference logprobs, adapter weights, or routed tokens.
State the contract. Give the tensor shape, probability rule, loss inputs, or trainable-parameter formula before coding.
Write the decisive operation. Apply the causal mask before softmax, shift SFT labels, filter logits before sampling, merge a LoRA delta, or top-k route experts.
Name one failure. Examples are future leakage, cache-shape confusion, reward/reference mix-ups, over-broad sampling, incorrect LoRA merge, or expert imbalance.
Connect to systems cost. Mention sequence length, KV heads, dtype bytes, beam width, rollout/reference passes, adapter state, or all-to-all traffic.
Finish with validation. Use no-leak, cache equivalence, centered advantages, normalized filtered probabilities, merged-weight equivalence, or expert-load checks.

Source / Inspiration

Topic map, not vendored content

This page is clean-room InfraLens material inspired by the public topic structure of NashKnight/LLM-Whiteboard. External Markdown and PDF content are not copied into this repository.