# Profiling Checklist

## Configuration fields

Read these as the variables that must be known before a profiling trace can be
explained. If you later run a real profile, record them before comparing traces.

- Model:
- Script / commit:
- Hardware:
- Driver / CUDA:
- PyTorch:
- Precision:
- Batch size:
- Sequence length:
- Input/output length distribution:
- Warmup steps:
- Measurement steps:
- Random seed:

## Baseline symptoms

| Symptom | Evidence | Possible cause | Next action |
| --- | --- | --- | --- |
| GPU timeline gaps | Nsight Systems | dataloader, CPU sync, launch overhead | inspect CPU trace and synchronization points |
| Long NCCL blocks | Nsight Systems | topology, bucket size, missing overlap | check rank mapping and communication groups |
| High memory throughput, low compute | Nsight Compute | HBM-bound kernel | try fusion, tiling, layout, FlashAttention, quantization |
| Shared memory conflicts | Nsight Compute | bank mapping | try padding or access remapping |
| P99 latency spikes | serving trace | queueing, KV fragmentation, workload mix | fix workload distribution and inspect scheduler |

## One-change rule

Write the hypothesis before changing code:

- Hypothesis:
- One variable changed:
- Expected metric movement:
- Actual metric movement:
- Keep / rollback decision: