Training Runtime

RL Infrastructure Interview Practice

Answer from roles, versioned state, throughput, and recovery instead of naming an optimization objective alone.

#
Question Map

Four system explanations

01What roles exist in an online LLM RL training system?

Short Answer

Actors generate versioned trajectories, evaluators or reward logic score outcomes, a reference policy can anchor KL accounting, and learners update and publish policy weights. PPO may also include value estimation; GRPO-style objectives can form relative advantages from grouped samples.

Common Mistake

Describing the objective while omitting the generation, publication, and recovery runtime.

02Why can more rollout actors harm an on-policy loop?

Short Answer

If generation exceeds learner consumption, trajectories queue behind new policy updates and carry stale policy versions. Bound version lag or discard delayed work, then measure backlog and freshness with throughput.

Follow-up

Use the RL Rollout Capacity Estimator to justify actor sizing assumptions.

03How do you choose colocated versus disaggregated actors and learners?

Short Answer

Colocation reduces weight-transfer overhead but makes serving KV caches contend with learner activations and optimizer state. Disaggregation allows independent scaling but requires weight publication, rollout versioning, transfer accounting, and explicit fault ownership.

PlacementBenefitCost or responsibility
ColocatedAvoids frequent policy-weight movement between pools.Rollout cache and learner training state compete for devices.
DisaggregatedActors and learners scale and recover independently.Requires versioning, transfer accounting, and fault ownership.
04What must be checkpointed for reliable recovery?

Short Answer

Persist model and optimizer state, applicable scheduler/scaler state, policy version, consumption cursor, and restart rules for in-flight trajectories. For sharded models, use a distributed checkpoint format that supports distributed save/load and resharding.

Source: PyTorch Distributed Checkpoint