InfraLens

A clear starting point for learning AI infrastructure.

Overview

Lab 09: Nsight Profiling Workflow

Annotated code reading lab. Running code is optional.

Concept Goal

Read code to understand the concept

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

Mental Model

Core mechanism

  • Nsight Systems answers where time goes across CPU, CUDA kernels, memory copies and NCCL.
  • Nsight Compute answers why one kernel behaves the way it does.
  • PyTorch Profiler helps map framework-level operations to lower-level GPU work.
  • The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.
Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

Annotated Code Preview

Starter Preview

Excerpt from code/lab-09-nsight-profiling/profile_commands.sh. This preview explains the key idea; the linked starter file is the source of truth.

torchrun --nproc-per-node=${NPROC} ${TRAIN_SCRIPT} --config ${CONFIG}

nsys profile   --trace=cuda,nvtx,osrt   --output=reports/nsys_baseline   python ${TRAIN_SCRIPT} --config ${CONFIG}

ncu --set full   --target-processes all   python ${KERNEL_BENCH}
Line-by-line Explanation

Key code blocks

torchrun
Represents a distributed workload entry point; useful when profiling DDP/FSDP behavior.
nsys profile
Collects timeline-level evidence: CPU gaps, CUDA kernels, H2D/D2H and NCCL.
--trace=cuda,nvtx,osrt
Selects the activity domains that make the timeline meaningful.
ncu --set full
Drills into kernel counters such as memory throughput, occupancy and stalls.
one-change checklist
Forces causal reasoning instead of changing many variables at once.
What to Notice

How to read this code

  • Systems and Compute answer different questions.
  • A profile screenshot is not a conclusion; the conclusion is the bottleneck explanation.
  • Roofline intuition helps avoid optimizing FLOPs on a memory-bound path.
Common Misunderstandings

What this code does not mean

  • “Profiler output automatically tells the fix.” It gives evidence; you still need a hypothesis.
  • “One run is enough.” Profiling needs fixed variables and repeated comparison when noise matters.
Interview Explanation

How to say it out loud

I first use PyTorch Profiler to find hot framework ops, Nsight Systems to see the end-to-end timeline and communication gaps, then Nsight Compute to inspect hot kernels. I classify the bottleneck using evidence and change one variable before remeasuring.

External intuition notes

Additional intuition

Further Reading

Official, paper and practical references