# Lab 07: FlashAttention Mental Model

This starter is a readable online-softmax walkthrough. It is not optimized and
does not implement the full production kernel; it exists to make the memory IO
idea visible in ordinary PyTorch code.

## Reading focus

- Naive attention materializes an `S x S` score matrix and probability matrix.
- The blockwise loop visits K/V in chunks.
- `m`, `l` and `out` are the running max, denominator and output accumulator.
- The update rule preserves softmax math while avoiding a full probability matrix.
- Real FlashAttention adds masking, batching, backward pass and kernel-level memory scheduling.

## Optional command

If you later want to compare the small tensors:

```bash
python3 flashattention_mental_model.py
python3 flashattention_mental_model.py --seq-len 16 --head-dim 8 --block-size 4
```

## Questions to answer while reading

- What tensor does naive attention keep alive that the blockwise version avoids?
- Why must old output contributions be rescaled when the running max changes?
- Why is this an IO-aware algorithm rather than a new attention formula?