Multimodal Serving Labs

Labs

Trace stages, transfers, and stream pressure

These labs treat intermediate data ownership and bounded queues as part of the model-serving design.

Lab 01 · Runnable

Multistage Event Trace

Execute a bounded autoregressive -> diffusion -> VAE decode flow and inspect rejected work when admission exceeds stage capacity.

Open Python starter Pipeline context

Lab 02 · Design worksheet

Video / Diffusion Parallel Plan

Start with latent token count and denoising-stage memory. Decide whether stage placement, sequence parallel execution, or bounded concurrency addresses the measured bottleneck.

Use the Video Token Count Estimator before committing to a parallel layout.

Lab 03 · Streaming

Chunk Scheduler and Cancellation

Describe the first emitted chunk, queue capacity, cancellation propagation, and tensor cleanup for an audio or video output stage. Then enumerate the per-stage metrics required to confirm the behavior.

Reference implementation boundary: vLLM-Omni disaggregated inference docs.