# Lab 04: CUDA Reduce Optimization

This CUDA starter is for reading two reduction strategies. Compilation is
optional; the key is to understand how the kernels move work from global atomic
contention into staged block-local aggregation.

## Reading focus

- `atomic_reduce_kernel` lets many blocks update one global address.
- `shared_reduce_kernel` first accumulates in registers, then shared memory.
- `__syncthreads()` makes block-local shared-memory writes visible before each tree step.
- The second kernel launch reduces block partials into the final scalar.
- The warp-shuffle comment points to a common next optimization after the shared-memory tree.

## Optional commands

If you later use a CUDA development machine:

```bash
chmod +x build.sh
./build.sh
./reduce
ncu --set full ./reduce
```

## Questions to answer while reading

- Why is one global `atomicAdd` destination a bottleneck?
- What does shared memory save, and what synchronization cost does it add?
- Why are reductions often memory-bandwidth sensitive?