Serving Runtime

Multimodal Serving

Model an any-to-any request as typed stages with separate computation, memory ownership, stream semantics, and cancellation behavior.

Representative flow
prompt / audio / image
  -> encoder or AR thinker
  -> codec tokens or diffusion latents
  -> VAE / vocoder / decoder
  -> streamed audio, image, video, or text
Orchestration

Each stage needs an explicit service contract

#

Typed inputs and outputs

Autoregressive stages may emit incremental tokens; diffusion or VAE stages consume latents and usually complete bounded transformations. Connectors must state shape, dtype, ownership, and cancellation behavior.

Bounded in-flight state

A slow decoder must not allow upstream stages to create unbounded intermediate tensors. Bound queues and measure depth, transfer bytes, first output latency, and drain behavior.

Stage typeEmission behaviorQueue concernPrimary measurement
Autoregressive generationEmits incremental tokens or codec units.Long-running active sequences need cancellation and bounded cache ownership.First output latency and time per emitted unit.
Diffusion stageTransforms latents over repeated internal steps before emission.Large latent residency can block downstream capacity.Stage latency, latent bytes, and queue depth.
VAE or output decoderTurns final latents or codec state into delivered media.Slow finalization must not accumulate unbounded intermediates.Drain latency, peak buffers, and cancellation cleanup.
Latency ledger

end-to-end latency = queue time + sum(stage compute + transfer). A single overall number hides the stage that should be scaled or repaired.

Parallelism

Place and parallelize the bottleneck stage

#

Large video or diffusion stages can use sequence or tensor parallel execution within a stage, while stage disaggregation scales heterogeneous components independently. Neither choice removes intermediate-transfer cost or the need for backpressure.

Failures

Cancellation and partial output are first-class state

#

When a client cancels a streamed request, downstream emission, intermediate buffers, and upstream work must be released or drained deterministically. Measure per-stage errors and queue high-water marks to distinguish model failure, transfer failure, and overload.

References

Primary sources

#