Lab 09: Nsight Profiling Workflow
Annotated code reading lab. Running code is optional.
Profiling
Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Read code to understand the concept
Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Core mechanism
- Nsight Systems answers where time goes across CPU, CUDA kernels, memory copies and NCCL.
- Nsight Compute answers why one kernel behaves the way it does.
- PyTorch Profiler helps map framework-level operations to lower-level GPU work.
- The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.
Annotated starter links
These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.
Starter Preview
Excerpt from code/lab-09-nsight-profiling/profile_commands.sh. This preview explains the key idea; the linked starter file is the source of truth.
torchrun --nproc-per-node=${NPROC} ${TRAIN_SCRIPT} --config ${CONFIG}
nsys profile --trace=cuda,nvtx,osrt --output=reports/nsys_baseline python ${TRAIN_SCRIPT} --config ${CONFIG}
ncu --set full --target-processes all python ${KERNEL_BENCH}Key code blocks
torchrun- Represents a distributed workload entry point; useful when profiling DDP/FSDP behavior.
nsys profile- Collects timeline-level evidence: CPU gaps, CUDA kernels, H2D/D2H and NCCL.
--trace=cuda,nvtx,osrt- Selects the activity domains that make the timeline meaningful.
ncu --set full- Drills into kernel counters such as memory throughput, occupancy and stalls.
one-change checklist- Forces causal reasoning instead of changing many variables at once.
How to read this code
- Systems and Compute answer different questions.
- A profile screenshot is not a conclusion; the conclusion is the bottleneck explanation.
- Roofline intuition helps avoid optimizing FLOPs on a memory-bound path.
What this code does not mean
- “Profiler output automatically tells the fix.” It gives evidence; you still need a hypothesis.
- “One run is enough.” Profiling needs fixed variables and repeated comparison when noise matters.
How to say it out loud
I first use PyTorch Profiler to find hot framework ops, Nsight Systems to see the end-to-end timeline and communication gaps, then Nsight Compute to inspect hot kernels. I classify the bottleneck using evidence and change one variable before remeasuring.
Additional intuition
- Nsight Systems should be read as timeline evidence: CPU gaps, CUDA launches, NCCL work and idle regions appear before kernel-level diagnosis. Official: Nsight Systems documentation
- Nsight Compute is the kernel microscope, so use it after you know which kernel or section is worth inspecting. Official: Nsight Compute documentation
- NVIDIA profiling blog material is useful for the workflow idea: start broad, drill down, then verify one hypothesis with a controlled change. Blog: NVIDIA profiling and optimizing deep neural networks
