Lab 04: CUDA Reduce Optimization
Annotated code reading lab. Running code is optional.
GPU Kernels / Profiling
Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Read code to understand the concept
Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Core mechanism
- Explain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.
- Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Annotated starter links
These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.
Starter Preview
Excerpt from code/lab-04-cuda-reduce/reduce.cu. This preview explains the key idea; the linked starter file is the source of truth.
__global__ void atomic_reduce_kernel(const float* x, float* out, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float local = 0.0f;
for (int i = idx; i < n; i += blockDim.x * gridDim.x) local += x[i];
atomicAdd(out, local); // many blocks contend on one global address
}
__global__ void shared_reduce_kernel(const float* x, float* partial, int n) {
extern __shared__ float smem[];
smem[threadIdx.x] = local_sum_for_this_thread;
__syncthreads();
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
if (threadIdx.x < stride) smem[threadIdx.x] += smem[threadIdx.x + stride];
__syncthreads();
}
if (threadIdx.x == 0) partial[blockIdx.x] = smem[0];
}Key code blocks
blockIdx/threadIdx- Identify which slice of input each thread processes.
atomicAdd- Correct but potentially slow when many threads or blocks update one location.
extern __shared__- Allocates fast block-local scratch space for reduction.
__syncthreads- Prevents reading partial sums before other threads have written them.
partial[blockIdx.x]- Writes one result per block, reducing global contention.
How to read this code
- Shared memory improves locality but requires correct synchronization.
- The second kernel still needs a second-stage reduction for block partials.
- Reduction performance often depends on memory traffic and synchronization, not only arithmetic.
What this code does not mean
- “Atomic operations are always bad.” They are simple and sometimes acceptable, but high contention hurts.
- “Shared memory automatically makes code faster.” Access pattern and occupancy still matter.
How to say it out loud
The baseline uses global atomics, so many blocks contend on one memory location. A shared-memory tree reduce first combines values inside each block and writes one partial result per block, reducing global writes. The tradeoff is more synchronization and shared-memory management.
Additional intuition
- CUDA Best Practices is the authority for memory access behavior; use it to separate coalescing, shared memory and occupancy questions. Official: CUDA Best Practices Guide
- The NVIDIA shared-memory blog is a practical intuition source: shared memory helps when it turns repeated or awkward global access into local cooperative access. Blog: NVIDIA using shared memory in CUDA C/C++
- Nsight Compute docs reinforce that reduce optimization should be explained with kernel counters, not just code appearance. Official: Nsight Compute documentation
