Multimodal Serving Interview Practice

Stage Orchestration

Multimodal Serving Interview Practice

Explain a heterogeneous request as stage ownership, queueing, transfer, streaming, and failure handling.

Q&A

Q&A Cards

01Why is multimodal generation serving more than LLM token decoding?

Short Answer

A request may pass through encoders, autoregressive stages, diffusion or flow generation, VAE decode, and audio decoding. These stages expose different batching, memory, and partial-output behavior.

02What does stage disaggregation buy, and what does it cost?

Short Answer

It allows each stage to scale and place resources independently, but requires typed transfers, connector reliability, queue bounds, cancellation propagation, and observability across boundaries.

Source: vLLM-Omni disaggregated inference

03How would you debug high first-chunk latency?

Short Answer

Break latency into per-stage queue, compute, and transfer times; find whether the first streaming stage is blocked by encoding, AR generation, transfer, diffusion work, or output decode.

Common Mistake

Optimizing a later-stage throughput metric without identifying the stage delaying first output.

04How do slow stages cause memory failures?

Short Answer

Without bounded queues, upstream work accumulates intermediate tensors while a slow output stage drains them. Bound admissions and track queue depth plus peak active memory per stage.