Training Runtime

RL Infrastructure

Connect policy objectives to the production system that generates, scores, updates, publishes, and recovers versioned trajectories.

Start here

The online loop

prompts -> rollout actors (policy v7) -> reward / advantage
          |                                   |
          +---- publish weights <- learner <-+
                         |
                    checkpoint + metrics
Roles

Separate objective state from runtime roles

#

Actors and evaluators

Actors produce response tokens and carry a policy version. Evaluators compute reward or rule outcomes and may consult a reference policy for KL-related accounting.

Learners

Learners consume versioned samples, compute the specified PPO- or GRPO-style objective, apply optimizer updates, and publish a new serving snapshot.

Do not conflate roles

PPO and GRPO define optimization behavior; actor placement, reward serving, learner sharding, and weight transfer define the infrastructure cost and failure model.

Synchronization

Capacity without freshness is not an on-policy win

#

If actors produce tokens faster than learners consume them, the queue holds trajectories from older policy versions. Record the policy version on each trajectory, bound acceptable version lag, and expose discarded or delayed samples.

Operational ledger

generated tokens / interval, learner tokens / update, backlog delta, version lag, KL, reward distribution, and publish latency belong in one dashboard.

Placement

Colocation trades transfers for contention

#

Colocated runtime

Generation and learning share accelerators or time slices. Weight refresh is simpler, but KV cache, activations, optimizer state, and kernels compete for memory and schedule time.

Disaggregated runtime

Actor pools and learner groups scale independently. This requires explicit weight publication, sample versioning, transfer accounting, and failure ownership.

Recovery

Checkpoint the loop, not only the weights

#

A restartable run needs model state, optimizer state, scheduler/scaler state where applicable, policy version, data/rollout cursors, and the rule for accepting in-flight trajectories. Sharded training requires checkpoint APIs that understand distributed state and load-time resharding.

References

Primary sources

#