LLM Live Coding

LLM Whiteboard Practice

Use this sprint when an interview prompt starts with a blank editor and an LLM mechanism. The target answer is not a full training stack. It is an explicit tensor or probability contract, compact implementation logic, and a validation check you can say out loud.

#
Primary path
attention -> decoder + KV cache -> losses
       -> decoding policy -> LoRA / MoE
       -> implement, test, explain
Module 01

QKV source

#
Scaled dot-product attention

Attention(Q, K, V) = softmax((Q K^T + M) / sqrt(Dh)) V

M blocks future keys before softmax; Dh = D / H for ordinary MHA.

Prompt shapeWhat to write firstWhiteboard check
Self-attentionQ, K, and V all project from the same hidden state X[B,S,D].You can name the input tensor for every projection.
Cross-attentionQ comes from the decoder state; K/V come from encoder or modality memory.You do not accidentally reuse decoder K/V.
Multimodal bridgeText queries may read visual K/V after projection into a compatible hidden dimension.You state the bridge dimension before attention.

Self-attention shape ledger

StepTensorShape to say out loud
InputX[B, S, D]
Project and split headsQ, K, V[B, H, S, Dh]
Scores and maskQ @ K.transpose(-2, -1)[B, H, S, S]
Weighted valuessoftmax(scores) @ V[B, H, S, Dh]
Merge and outputcontext.transpose(...).reshape(...)[B, S, D]
Module 02

MHA, MQA, GQA, and MLA

#
VariantKV organizationInterview implication
MHAKV heads match query heads.Largest standard KV cache; easiest shape story.
MQAOne KV head is shared across query heads.Small cache, but implementation must repeat or broadcast K/V for attention.
GQAGroups of query heads share each KV head.Middle ground used to reduce decode bandwidth and memory.
MLACache stores a compressed latent representation plus architecture-specific reconstruction.Do not describe it as only fewer KV heads; the cached representation changes.
Cache term affected by the variant

KV bytes = 2 * B * T * layers * H_kv * Dh * bytes_per_element

MQA and GQA reduce H_kv; MLA changes what representation is cached, so do not present it as a head-count edit alone.

Module 03

RoPE, RMSNorm, SwiGLU, and the causal mask

#

RoPE

Rotate Q/K feature pairs before the dot product. In a whiteboard answer, state that RoPE changes positional geometry, not the KV-cache byte formula by itself.

RMSNorm

Normalize by root-mean-square magnitude over the hidden dimension, then apply a learned scale. It is enough for the sprint to preserve [B,S,D].

SwiGLU

Project into gate and value branches, multiply silu(gate) by the value branch, then project back to the model dimension.

Causal mask

Apply the keep/block rule before softmax. The reference example includes a no-leak smoke test: changing future tokens must not change prefix outputs.

Module 04

Decoder block flow

#
x: [B,S,D]
x = x + CausalSelfAttention(RMSNorm(x))
x = x + SwiGLU(RMSNorm(x))
assert x.shape == [B,S,D]

Keep the block pre-norm and residual-first in the explanation. The sprint implementation intentionally omits training-only extras such as dropout so the shape and masking invariants stay visible.

Module 05

KV-cache decode

#
Memory ledger

KV bytes = 2 * B * T * layers * H_kv * Dh * bytes_per_element

Write which values represent active requests, retained prompt-plus-output tokens, KV heads, and dtype before calculating a capacity claim.

PhaseState changeSystem caveat
PrefillProcess the prompt and create K/V for all prompt tokens.Attention work is full-sequence and often compute-heavy.
DecodeAppend one token's K/V and read retained history for the new query.Decode often becomes cache-read and memory-bandwidth-heavy.
ValidationCompare full causal attention with step-by-step cached decode.Cache equivalence must hold before optimizing layout.
Module 06

SFT, DPO, PPO / KL, and GRPO

#

Do not treat all post-training losses as interchangeable code. Start by naming whether the input is shifted language-model labels, preference pairs, or sampled reward-bearing responses.

Loss contracts to write before code

SFT: CE(logits[:, :-1], tokens[:, 1:])

DPO: -log sigmoid(beta * ((log pi_chosen - log pi_rejected) - (log ref_chosen - log ref_rejected)))

GRPO: advantage_i = (reward_i - mean(group_rewards)) / std(group_rewards)

ObjectiveState and core operationWhiteboard validation
SFT / cross entropyShift tokens: input_ids = tokens[:-1], labels = tokens[1:]; score one vocabulary distribution per position.Assert shifted lengths and ignore padded label positions.
DPOUse chosen/rejected policy log-probability margin minus the frozen reference margin inside a logistic loss.A stronger chosen policy margin should lower loss.
PPO with KL controlOptimize sampled rewards while penalizing drift from a reference policy; rollout state matters.Name reward, reference logprobs, and KL monitoring.
GRPONormalize verifier or reward values within a response group for the same prompt.Group-relative advantages should be centered within each group.
Module 07

Greedy, temperature, top-k, top-p, and beam search

#
Inference-time distribution transform

p_T(i) = exp(z_i / T) / sum_j exp(z_j / T)

top-k: keep the k largest logits; top-p: keep the sorted prefix whose cumulative mass first reaches p

beam_score(y_1:t) = sum_u log p(y_u | y_<u, x)

Temperature changes sharpness; top-k fixes candidate count; top-p adapts candidate count to distribution mass; beam search accumulates sequence scores.

PolicyImplementation moveFailure or tradeoff to mention
Greedyargmax(logits)Deterministic and cheap, but can be repetitive or locally shortsighted.
TemperatureDivide logits by T before softmax.Low T sharpens; high T increases randomness.
Top-k / top-pMask unwanted logits to negative infinity, then sample.Fixed count and probability mass are different policies.
Beam searchRetain highest-scoring partial sequences at each step.Adds branching/cache work and is not universally best for open-ended text.
Separate policy from acceleration

These rules choose output tokens from model logits. Speculative decoding instead proposes and verifies tokens while preserving its stated target-policy correctness condition.

Module 08

LoRA adaptation and MoE routing

#
ExtensionMinimal code contractDo not confuse it with
LoRAFreeze W0; add (alpha / r) * B(A(x)); test that merged W0 + (alpha / r) * B @ A matches.A smaller base model or conditional expert routing.
MoECompute router probabilities, choose top-k experts per token, run selected FFNs, and combine by gate weights.Parameter-efficient fine-tuning; MoE changes per-token execution.
LoRA parameter ledger

W' = W0 + (alpha / r) * B @ A

y = x @ W0.T + (alpha / r) * x @ A.T @ B.T

rank(delta_W) <= r and trainable = r * (d_in + d_out) for one bias-free adapted projection.

MoE routing ledger

y_token = sum_{e in top_k(router(x))} gate_e(x) * Expert_e(x)

Report active experts per token, capacity or dropped-token behavior, and whether expert traffic crosses devices.

For MoE, conclude with load balance and expert-parallel communication; sparse active compute is not free capacity. For LoRA, state target modules and whether adapters are merged or dynamically served.

Spoken Checklist

How to say the answer out loud

#
  1. Name the state. Say whether you are transforming hidden states, cache tensors, logits, preference logprobs, adapter weights, or routed tokens.
  2. State the contract. Give the tensor shape, probability rule, loss inputs, or trainable-parameter formula before coding.
  3. Write the decisive operation. Apply the causal mask before softmax, shift SFT labels, filter logits before sampling, merge a LoRA delta, or top-k route experts.
  4. Name one failure. Examples are future leakage, cache-shape confusion, reward/reference mix-ups, over-broad sampling, incorrect LoRA merge, or expert imbalance.
  5. Connect to systems cost. Mention sequence length, KV heads, dtype bytes, beam width, rollout/reference passes, adapter state, or all-to-all traffic.
  6. Finish with validation. Use no-leak, cache equivalence, centered advantages, normalized filtered probabilities, merged-weight equivalence, or expert-load checks.
Source / Inspiration

Topic map, not vendored content

#

This page is clean-room InfraLens material inspired by the public topic structure of NashKnight/LLM-Whiteboard. External Markdown and PDF content are not copied into this repository.