Lab 03 - Latent Shape Tracking

Overview

Lab 03: Latent Shape Tracking

Annotated code reading lab. Running code is optional.

Related handbook section

Latent Shape Tracking

This lab maps directly to the handbook section. Read the related handbook section first, then use the lab page and starter file to connect the concept to concrete variables, shapes, APIs, and interview-ready explanations.

Open handbook section Open starter file

Concept Goal

Latent Shape Tracking

Track latent shape through batch, channel, frame, height and width axes.

Mental Model

Mechanism to keep in mind

`frames` is a semantic axis, not batch.
`scale_factor` shrinks spatial dimensions.
`latent_shape` is what denoiser/scheduler update.

image/video input or noise
  -> VAE latent space
  -> latent tensor keeps batch, channel, frame/time, height, width axes
  -> denoiser and scheduler update latent values
  -> VAE decode maps latent back to pixels/frames

Implementation-dependent: exact video latent rank and channel count vary by model family; use this as a shape-tracking mental model and verify current DiffSynth source for a specific pipeline.

Annotated Code Preview

Starter preview

Excerpt from code/lab-03-latent-shape-tracking/latent_shapes.py. The linked starter file is the source of truth.

Open starter file

# Latent Shape Tracking
# Annotated reading material. Running this file is optional.
# Source-of-truth focus: Track latent shape through batch, channel, frame, height and width axes.

batch, frames, height, width = 1, 16, 720, 1280
latent_channels = 16
scale = 8
video_latent_shape = (batch, latent_channels, frames, height // scale, width // scale)

# What to explain while reading:
# - frames is a semantic axis, not batch.
# - scale_factor shrinks spatial dimensions.
# - latent_shape is what denoiser/scheduler update.
#
# Common traps:
# - Do not confuse batch with frames.
# - Do not assume image and video latents have the same rank.

Line-by-line Explanation

What each block is doing

Setup / contract: `frames` is a semantic axis, not batch.
Main transition: `scale_factor` shrinks spatial dimensions.
Interview hook: `latent_shape` is what denoiser/scheduler update.

What to Notice

Reading checkpoints

Video latents add a time/frame dimension.
Flattening axes in code does not remove their meaning.
Shape tracking is often the fastest way to debug pipeline wiring.

Common Misunderstandings

What this lab prevents

Do not confuse batch with frames.
Do not assume image and video latents have the same rank.

Interview Explanation

How to say it out loud

Track latent shape through batch, channel, frame, height and width axes. Then explain the code by naming the state being transformed, the axis or shape that matters, and the tradeoff that would appear in a real system.

External intuition notes

Additional intuition

Use official docs and papers for API behavior and factual claims; use blogs only to improve the mental picture.
If support matrices, performance behavior or backend choices are version-sensitive, check current docs before repeating them.
A strong interview answer names the state object, the shape or axis it changes, and the tradeoff it creates.

InfraLens