Lab 09 - Nsight Profiling Workflow

Overview

Lab 09: Nsight Profiling Workflow

Annotated code reading lab. Running code is optional.

Related handbook section

Profiling

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

Profiling

Concept Goal

Read code to understand the concept

Mental Model

Core mechanism

Nsight Systems answers where time goes across CPU, CUDA kernels, memory copies and NCCL.
Nsight Compute answers why one kernel behaves the way it does.
PyTorch Profiler helps map framework-level operations to lower-level GPU work.
The memory ledger separates weights, gradients, optimizer states, activations, temporary buffers, communication buckets, and KV Cache so the scaling bottleneck can be named precisely.

Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

README profile_commands.sh profiling_checklist.md

Annotated Code Preview

Starter Preview

Excerpt from code/lab-09-nsight-profiling/profile_commands.sh. This preview explains the key idea; the linked starter file is the source of truth.

Open starter file

torchrun --nproc-per-node=${NPROC} ${TRAIN_SCRIPT} --config ${CONFIG}

nsys profile   --trace=cuda,nvtx,osrt   --output=reports/nsys_baseline   python ${TRAIN_SCRIPT} --config ${CONFIG}

ncu --set full   --target-processes all   python ${KERNEL_BENCH}

Line-by-line Explanation

Key code blocks

torchrun: Represents a distributed workload entry point; useful when profiling DDP/FSDP behavior.
nsys profile: Collects timeline-level evidence: CPU gaps, CUDA kernels, H2D/D2H and NCCL.
--trace=cuda,nvtx,osrt: Selects the activity domains that make the timeline meaningful.
ncu --set full: Drills into kernel counters such as memory throughput, occupancy and stalls.
one-change checklist: Forces causal reasoning instead of changing many variables at once.

What to Notice

How to read this code

Systems and Compute answer different questions.
A profile screenshot is not a conclusion; the conclusion is the bottleneck explanation.
Roofline intuition helps avoid optimizing FLOPs on a memory-bound path.

Common Misunderstandings

What this code does not mean

“Profiler output automatically tells the fix.” It gives evidence; you still need a hypothesis.
“One run is enough.” Profiling needs fixed variables and repeated comparison when noise matters.

Interview Explanation

How to say it out loud

I first use PyTorch Profiler to find hot framework ops, Nsight Systems to see the end-to-end timeline and communication gaps, then Nsight Compute to inspect hot kernels. I classify the bottleneck using evidence and change one variable before remeasuring.

External intuition notes

Additional intuition

Nsight Systems should be read as timeline evidence: CPU gaps, CUDA launches, NCCL work and idle regions appear before kernel-level diagnosis. Official: Nsight Systems documentation
Nsight Compute is the kernel microscope, so use it after you know which kernel or section is worth inspecting. Official: Nsight Compute documentation
NVIDIA profiling blog material is useful for the workflow idea: start broad, drill down, then verify one hypothesis with a controlled change. Blog: NVIDIA profiling and optimizing deep neural networks

InfraLens