Lab 05 - Shared Memory Bank Conflict

Overview

Lab 05: Shared Memory Bank Conflict

Annotated code reading lab. Running code is optional.

Related handbook section

GPU Kernels

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.

GPU Kernels

Concept Goal

Read code to understand the concept

Mental Model

Core mechanism

Kernel performance depends on data movement as much as math. Use memory hierarchy, tiling, fusion, coalescing, bank conflicts, and profiler counters to explain whether the workload is bandwidth-bound or compute-bound.
Explain the problem, the mechanism, the resource tradeoff, the common failure mode, and the measurement that would validate the claim.

Starter files

Annotated starter links

These files are reading material first. If you later decide to run them, treat the run as optional validation rather than the main learning path.

README transpose.cu build.sh

Annotated Code Preview

Starter Preview

Excerpt from code/lab-05-bank-conflict/transpose.cu. This preview explains the key idea; the linked starter file is the source of truth.

Open starter file

constexpr int TILE = 32;

__global__ void transpose_tiled(const float* in, float* out, int n) {
  __shared__ float tile[TILE][TILE];      // 32 x 32
  tile[threadIdx.y][threadIdx.x] = in[y * n + x];
  __syncthreads();
  out[y * n + x] = tile[threadIdx.x][threadIdx.y];
}

__global__ void transpose_padded(const float* in, float* out, int n) {
  __shared__ float tile[TILE][TILE + 1];  // 32 x 33 changes bank mapping
  tile[threadIdx.y][threadIdx.x] = in[y * n + x];
  __syncthreads();
  out[y * n + x] = tile[threadIdx.x][threadIdx.y];
}

Line-by-line Explanation

Key code blocks

tile[32][32]: The natural square tile is easy to write but can produce bank conflicts during transposed access.
tile[32][33]: The extra column changes the address stride, spreading accesses across banks.
__syncthreads: Ensures the tile is fully loaded before threads read transposed positions.
global load/store: The tile is used to make global memory access more coalesced while managing shared-memory layout separately.

What to Notice

How to read this code

Padding helps a specific shared-memory layout problem; it is not a universal speed button.
Coalesced global memory and bank-conflict-free shared memory must both be considered.
The same idea appears in GEMM tiles and other shared-memory kernels.

Common Misunderstandings

What this code does not mean

“Shared memory is always faster than global memory.” It is faster only when access patterns avoid conflicts and occupancy remains healthy.
“The padding changes the math.” It changes memory layout, not the transpose result.

Interview Explanation

How to say it out loud

Transpose uses shared memory to decouple coalesced global reads from coalesced global writes. But reading a 32x32 tile by columns can map a warp to the same bank. Adding a padding column changes the stride and reduces bank conflicts.

External intuition notes

Additional intuition

CUDA Best Practices gives the fact base: bank conflicts split a shared-memory request into separate conflict-free requests, reducing effective bandwidth. Official: CUDA Best Practices shared memory
The NVIDIA shared-memory blog is useful for the first mental distinction: shared memory can fix non-coalesced global access, but it has its own bank layout constraints. Blog: NVIDIA using shared memory in CUDA C/C++
Nsight Compute is the practical follow-up because bank conflict claims should eventually be checked with shared-memory metrics. Official: Nsight Compute documentation

InfraLens