Lab 03: Latent Shape Tracking
Annotated code reading lab. Running code is optional.
Latent Shape Tracking
This lab maps directly to the handbook section. Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.
Latent Shape Tracking
Track latent shape through batch, channel, frame, height and width axes.
Mechanism to keep in mind
- `frames` is a semantic axis, not batch.
- `scale_factor` shrinks spatial dimensions.
- `latent_shape` is what denoiser/scheduler update.
image/video input or noise
-> VAE latent space
-> latent tensor keeps batch, channel, frame/time, height, width axes
-> denoiser and scheduler update latent values
-> VAE decode maps latent back to pixels/frames
Implementation-dependent: exact video latent rank and channel count vary by model family; use this as a shape-tracking mental model and verify current DiffSynth source for a specific pipeline.
Starter preview
Excerpt from code/lab-03-latent-shape-tracking/latent_shapes.py. The linked starter file is the source of truth.
# Latent Shape Tracking
# Annotated reading material. Running this file is optional.
# Source-of-truth focus: Track latent shape through batch, channel, frame, height and width axes.
batch, frames, height, width = 1, 16, 720, 1280
latent_channels = 16
scale = 8
video_latent_shape = (batch, latent_channels, frames, height // scale, width // scale)
# What to explain while reading:
# - frames is a semantic axis, not batch.
# - scale_factor shrinks spatial dimensions.
# - latent_shape is what denoiser/scheduler update.
#
# Common traps:
# - Do not confuse batch with frames.
# - Do not assume image and video latents have the same rank.
What each block is doing
- Setup / contract
- `frames` is a semantic axis, not batch.
- Main transition
- `scale_factor` shrinks spatial dimensions.
- Interview hook
- `latent_shape` is what denoiser/scheduler update.
Reading checkpoints
- Video latents add a time/frame dimension.
- Flattening axes in code does not remove their meaning.
- Shape tracking is often the fastest way to debug pipeline wiring.
What this lab prevents
- Do not confuse batch with frames.
- Do not assume image and video latents have the same rank.
How to say it out loud
Track latent shape through batch, channel, frame, height and width axes. Then explain the code by naming the state being transformed, the axis or shape that matters, and the tradeoff that would appear in a real system.
Additional intuition
- Use official docs and papers for API behavior and factual claims; use blogs only to improve the mental picture.
- If support matrices, performance behavior or backend choices are version-sensitive, check current docs before repeating them.
- A strong interview answer names the state object, the shape or axis it changes, and the tradeoff it creates.
