RL Infrastructure Interview Practice
Answer from roles, versioned state, throughput, and recovery instead of naming an optimization objective alone.
Four system explanations
01What roles exist in an online LLM RL training system?
Short Answer
Actors generate versioned trajectories, evaluators or reward logic score outcomes, a reference policy can anchor KL accounting, and learners update and publish policy weights. PPO may also include value estimation; GRPO-style objectives can form relative advantages from grouped samples.
Common Mistake
Describing the objective while omitting the generation, publication, and recovery runtime.
02Why can more rollout actors harm an on-policy loop?
Short Answer
If generation exceeds learner consumption, trajectories queue behind new policy updates and carry stale policy versions. Bound version lag or discard delayed work, then measure backlog and freshness with throughput.
Follow-up
Use the RL Rollout Capacity Estimator to justify actor sizing assumptions.
03How do you choose colocated versus disaggregated actors and learners?
Short Answer
Colocation reduces weight-transfer overhead but makes serving KV caches contend with learner activations and optimizer state. Disaggregation allows independent scaling but requires weight publication, rollout versioning, transfer accounting, and explicit fault ownership.
| Placement | Benefit | Cost or responsibility |
|---|---|---|
| Colocated | Avoids frequent policy-weight movement between pools. | Rollout cache and learner training state compete for devices. |
| Disaggregated | Actors and learners scale and recover independently. | Requires versioning, transfer accounting, and fault ownership. |
04What must be checkpointed for reliable recovery?
Short Answer
Persist model and optimizer state, applicable scheduler/scaler state, policy version, consumption cursor, and restart rules for in-flight trajectories. For sharded models, use a distributed checkpoint format that supports distributed save/load and resharding.
Source: PyTorch Distributed Checkpoint
