Lab 04 - CUDA Reduce Optimization

Overview

Lab 04: CUDA Reduce Optimization

Annotated code reading lab. Running code is optional.

Related handbook section

GPU Kernels / Profiling

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

GPU Kernels Profiling

Concept Goal

Read code to understand the concept

Mental Model

Core mechanism

Explain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.
Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

README reduce.cu build.sh

Annotated Code Preview

Starter Preview

Excerpt from code/lab-04-cuda-reduce/reduce.cu. This preview explains the key idea; the linked starter file is the source of truth.

Open starter file

__global__ void atomic_reduce_kernel(const float* x, float* out, int n) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  float local = 0.0f;
  for (int i = idx; i < n; i += blockDim.x * gridDim.x) local += x[i];
  atomicAdd(out, local);  // many blocks contend on one global address
}

__global__ void shared_reduce_kernel(const float* x, float* partial, int n) {
  extern __shared__ float smem[];
  smem[threadIdx.x] = local_sum_for_this_thread;
  __syncthreads();

  for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
    if (threadIdx.x < stride) smem[threadIdx.x] += smem[threadIdx.x + stride];
    __syncthreads();
  }
  if (threadIdx.x == 0) partial[blockIdx.x] = smem[0];
}

Line-by-line Explanation

Key code blocks

blockIdx/threadIdx: Identify which slice of input each thread processes.
atomicAdd: Correct but potentially slow when many threads or blocks update one location.
extern __shared__: Allocates fast block-local scratch space for reduction.
__syncthreads: Prevents reading partial sums before other threads have written them.
partial[blockIdx.x]: Writes one result per block, reducing global contention.

What to Notice

How to read this code

Shared memory improves locality but requires correct synchronization.
The second kernel still needs a second-stage reduction for block partials.
Reduction performance often depends on memory traffic and synchronization, not only arithmetic.

Common Misunderstandings

What this code does not mean

“Atomic operations are always bad.” They are simple and sometimes acceptable, but high contention hurts.
“Shared memory automatically makes code faster.” Access pattern and occupancy still matter.

Interview Explanation

How to say it out loud

The baseline uses global atomics, so many blocks contend on one memory location. A shared-memory tree reduce first combines values inside each block and writes one partial result per block, reducing global writes. The tradeoff is more synchronization and shared-memory management.

External intuition notes

Additional intuition

CUDA Best Practices is the authority for memory access behavior; use it to separate coalescing, shared memory and occupancy questions. Official: CUDA Best Practices Guide
The NVIDIA shared-memory blog is a practical intuition source: shared memory helps when it turns repeated or awkward global access into local cooperative access. Blog: NVIDIA using shared memory in CUDA C/C++
Nsight Compute docs reinforce that reduce optimization should be explained with kernel counters, not just code appearance. Official: Nsight Compute documentation

InfraLens