prompt / audio / image
-> encoder or AR thinker
-> codec tokens or diffusion latents
-> VAE / vocoder / decoder
-> streamed audio, image, video, or textMultimodal Serving
Model an any-to-any request as typed stages with separate computation, memory ownership, stream semantics, and cancellation behavior.
Each stage needs an explicit service contract
Typed inputs and outputs
Autoregressive stages may emit incremental tokens; diffusion or VAE stages consume latents and usually complete bounded transformations. Connectors must state shape, dtype, ownership, and cancellation behavior.
Bounded in-flight state
A slow decoder must not allow upstream stages to create unbounded intermediate tensors. Bound queues and measure depth, transfer bytes, first output latency, and drain behavior.
| Stage type | Emission behavior | Queue concern | Primary measurement |
|---|---|---|---|
| Autoregressive generation | Emits incremental tokens or codec units. | Long-running active sequences need cancellation and bounded cache ownership. | First output latency and time per emitted unit. |
| Diffusion stage | Transforms latents over repeated internal steps before emission. | Large latent residency can block downstream capacity. | Stage latency, latent bytes, and queue depth. |
| VAE or output decoder | Turns final latents or codec state into delivered media. | Slow finalization must not accumulate unbounded intermediates. | Drain latency, peak buffers, and cancellation cleanup. |
end-to-end latency = queue time + sum(stage compute + transfer). A single overall number hides the stage that should be scaled or repaired.
Place and parallelize the bottleneck stage
Large video or diffusion stages can use sequence or tensor parallel execution within a stage, while stage disaggregation scales heterogeneous components independently. Neither choice removes intermediate-transfer cost or the need for backpressure.
Cancellation and partial output are first-class state
When a client cancels a streamed request, downstream emission, intermediate buffers, and upstream work must be released or drained deterministically. Measure per-stage errors and queue high-water marks to distinguish model failure, transfer failure, and overload.
