InfraLens - Multimodal Serving

Serving Runtime

Multimodal Serving

Model an any-to-any request as typed stages with separate computation, memory ownership, stream semantics, and cancellation behavior.

Representative flow

prompt / audio / image
  -> encoder or AR thinker
  -> codec tokens or diffusion latents
  -> VAE / vocoder / decoder
  -> streamed audio, image, video, or text

Connected material

Concept node Annotated labs Interview practice Pipeline reading Video tokens

Orchestration

Each stage needs an explicit service contract

Typed inputs and outputs

Autoregressive stages may emit incremental tokens; diffusion or VAE stages consume latents and usually complete bounded transformations. Connectors must state shape, dtype, ownership, and cancellation behavior.

Bounded in-flight state

A slow decoder must not allow upstream stages to create unbounded intermediate tensors. Bound queues and measure depth, transfer bytes, first output latency, and drain behavior.

Stage type	Emission behavior	Queue concern	Primary measurement
Autoregressive generation	Emits incremental tokens or codec units.	Long-running active sequences need cancellation and bounded cache ownership.	First output latency and time per emitted unit.
Diffusion stage	Transforms latents over repeated internal steps before emission.	Large latent residency can block downstream capacity.	Stage latency, latent bytes, and queue depth.
VAE or output decoder	Turns final latents or codec state into delivered media.	Slow finalization must not accumulate unbounded intermediates.	Drain latency, peak buffers, and cancellation cleanup.

Latency ledger

end-to-end latency = queue time + sum(stage compute + transfer). A single overall number hides the stage that should be scaled or repaired.

Parallelism

Place and parallelize the bottleneck stage

Large video or diffusion stages can use sequence or tensor parallel execution within a stage, while stage disaggregation scales heterogeneous components independently. Neither choice removes intermediate-transfer cost or the need for backpressure.

Failures

Cancellation and partial output are first-class state

When a client cancels a streamed request, downstream emission, intermediate buffers, and upstream work must be released or drained deterministically. Measure per-stage errors and queue high-water marks to distinguish model failure, transfer failure, and overload.

References