The online loop
prompts -> rollout actors (policy v7) -> reward / advantage
| |
+---- publish weights <- learner <-+
|
checkpoint + metricsConnect policy objectives to the production system that generates, scores, updates, publishes, and recovers versioned trajectories.
prompts -> rollout actors (policy v7) -> reward / advantage
| |
+---- publish weights <- learner <-+
|
checkpoint + metricsActors produce response tokens and carry a policy version. Evaluators compute reward or rule outcomes and may consult a reference policy for KL-related accounting.
Learners consume versioned samples, compute the specified PPO- or GRPO-style objective, apply optimizer updates, and publish a new serving snapshot.
PPO and GRPO define optimization behavior; actor placement, reward serving, learner sharding, and weight transfer define the infrastructure cost and failure model.
If actors produce tokens faster than learners consume them, the queue holds trajectories from older policy versions. Record the policy version on each trajectory, bound acceptable version lag, and expose discarded or delayed samples.
generated tokens / interval, learner tokens / update, backlog delta, version lag, KL, reward distribution, and publish latency belong in one dashboard.
Generation and learning share accelerators or time slices. Weight refresh is simpler, but KV cache, activations, optimizer state, and kernels compete for memory and schedule time.
Actor pools and learner groups scale independently. This requires explicit weight publication, sample versioning, transfer accounting, and failure ownership.
A restartable run needs model state, optimizer state, scheduler/scaler state where applicable, policy version, data/rollout cursors, and the rule for accepting in-flight trajectories. Sharded training requires checkpoint APIs that understand distributed state and load-time resharding.