InfraLens

A clear starting point for learning AI infrastructure.

Overview

Lab 04: CUDA Reduce Optimization

Annotated code reading lab. Running code is optional.

Concept Goal

Read code to understand the concept

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

Mental Model

Core mechanism

  • Explain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.
  • Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

Annotated Code Preview

Starter Preview

Excerpt from code/lab-04-cuda-reduce/reduce.cu. This preview explains the key idea; the linked starter file is the source of truth.

__global__ void atomic_reduce_kernel(const float* x, float* out, int n) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  float local = 0.0f;
  for (int i = idx; i < n; i += blockDim.x * gridDim.x) local += x[i];
  atomicAdd(out, local);  // many blocks contend on one global address
}

__global__ void shared_reduce_kernel(const float* x, float* partial, int n) {
  extern __shared__ float smem[];
  smem[threadIdx.x] = local_sum_for_this_thread;
  __syncthreads();

  for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
    if (threadIdx.x < stride) smem[threadIdx.x] += smem[threadIdx.x + stride];
    __syncthreads();
  }
  if (threadIdx.x == 0) partial[blockIdx.x] = smem[0];
}
Line-by-line Explanation

Key code blocks

blockIdx/threadIdx
Identify which slice of input each thread processes.
atomicAdd
Correct but potentially slow when many threads or blocks update one location.
extern __shared__
Allocates fast block-local scratch space for reduction.
__syncthreads
Prevents reading partial sums before other threads have written them.
partial[blockIdx.x]
Writes one result per block, reducing global contention.
What to Notice

How to read this code

  • Shared memory improves locality but requires correct synchronization.
  • The second kernel still needs a second-stage reduction for block partials.
  • Reduction performance often depends on memory traffic and synchronization, not only arithmetic.
Common Misunderstandings

What this code does not mean

  • “Atomic operations are always bad.” They are simple and sometimes acceptable, but high contention hurts.
  • “Shared memory automatically makes code faster.” Access pattern and occupancy still matter.
Interview Explanation

How to say it out loud

The baseline uses global atomics, so many blocks contend on one memory location. A shared-memory tree reduce first combines values inside each block and writes one partial result per block, reducing global writes. The tradeoff is more synchronization and shared-memory management.

External intuition notes

Additional intuition

Further Reading

Official, paper and practical references