Multimodal Serving Interview Practice
Explain a heterogeneous request as stage ownership, queueing, transfer, streaming, and failure handling.
Q&A Cards
01Why is multimodal generation serving more than LLM token decoding?
Short Answer
A request may pass through encoders, autoregressive stages, diffusion or flow generation, VAE decode, and audio decoding. These stages expose different batching, memory, and partial-output behavior.
02What does stage disaggregation buy, and what does it cost?
Short Answer
It allows each stage to scale and place resources independently, but requires typed transfers, connector reliability, queue bounds, cancellation propagation, and observability across boundaries.
03How would you debug high first-chunk latency?
Short Answer
Break latency into per-stage queue, compute, and transfer times; find whether the first streaming stage is blocked by encoding, AR generation, transfer, diffusion work, or output decode.
Common Mistake
Optimizing a later-stage throughput metric without identifying the stage delaying first output.
04How do slow stages cause memory failures?
Short Answer
Without bounded queues, upstream work accumulates intermediate tensors while a slow output stage drains them. Bound admissions and track queue depth plus peak active memory per stage.
