Self-Attention - The Gaze

1 Pretraining — the training corpus - phase 1 of training: the model only ever sees these endings, biased toward MongoDB

training corpus

the best database is ___

The model only ever sees these endings. mongodb shows up the most — so the model learns to predict mongodb.

adding new endings or increasing a count shifts the prediction toward the loudest word

2 Then RLHF fine-tunes — humans push back during training - one equation: new_logit = base_logit + R/β

RLHF is the second training phase: after pretraining sets the bias, humans add a reward signal that nudges the model toward preferred answers. It can't conjure new words — only redistribute probability among the ones already in the corpus. β is the leash: small β = reward dominates, large β = stay close to pretraining.

Edit the corpus to give RLHF something to push against.

human preference — the reward signal When a candidate is selected as the human-preferred answer, the model rewards it (+3.0) and penalizes the pretraining winner (−1.0).

synthetic preference data

The reward model turns a human preference into a tiny labeled corpus — same shape as pretraining, different signal. Tight β ignores it; loose β trains on it.

how much should the model listen to humans? 50%

pretraining wins (no RLHF override) humans win (reward dominates)

at this strength the model now predicts

— —

show the math

new_logit_y = base_logit_y + R(y) · strength

The strength value acts as a reward multiplier. In the KL-regularized form this is new_logit = base_logit + R/β; bigger strength = smaller β = humans win.

Reward vector R(y)

Key insight: RLHF can only redistribute mass among tokens pretraining already knows. A token never seen in training can't be reached for any finite reward.

3 How attention shaped the question - this is the mechanism that generated the predictions above

Steps 1 and 2 show what the model predicts and how training reshapes it. But those predictions don't appear from nowhere — attention is the underlying engine. Before guessing, the model asks which other words should I borrow meaning from? The arcs below show that answer for the focused word: each arc's thickness represents how much meaning flows from one token to another.

strongest attention from the focused word · notable · background

same step, full matrix view

The attention map - row i = where word i spends its budget

Same idea as the arcs above — just the full matrix. Each row sums to 1.0; brighter cells indicate stronger attention. Selecting a different row reveals how that word distributes its attention budget across the sentence.

following — → — —

the focused word determines which row of the attention matrix is highlighted

softmax(Q·Kᵀ / √d_k) · V attention weights

each cell shows attention weight from row-word to column-word 0.00 1.00

live stats

peak attention — —
row entropy — bits
saturation —

shape

X—
Q, K, V—
weights—
context—

4 “Attention shapes the question” — that's RAG and in-context learning - same equation, same softmax. only the neighbours change.

The hero line wasn't a metaphor. Everything you just played with is the entire reason RAG and in-context learning work on a frozen pretrained model. You can't retrain GPT-4 at chat time — the weights are locked. The only knob you have left is the context window. Whatever tokens you put in there get blended into the user's question by the exact same attention you've been watching. The “focused word” becomes the user's question. The “neighbours” become everything else you stuffed into the prompt.

level 1 · you just played with this
Self-attention
~10 tokens · one sentence

focus is

context the best database

Attention rewrites is as “is, in the context of best databases” — so the next-token prediction is forced to be a database name, not a verb. The word didn't change. Its neighbourhood did.
level 2 · same math, no retraining
In-context learning
~1k tokens · few-shot prompt

focus classify this review →

context “loved it” → positive “total waste” → negative “meh, it was fine” → neutral

No fine-tuning. No gradient step. The few-shot examples sit in the prompt and attention blends their answer pattern into the unanswered question — the same way is learned to mean “database.” The model didn't learn the task; the question got reshaped until the task was already implied.
level 3 · you choose the neighbours
RAG
~10k tokens · retrieved from your data

focus what's our refund policy on damaged items?

context policy.md ¶4 — damaged in transit… policy.md ¶7 — 30-day return window… ticket #8142 — escalation path…

A retriever (vector + lexical search over your data) picks which chunks land in the context; the model treats them like the rest of the sentence. The answer is now shaped by your documents — not because the model learned them, but because attention pulled meaning from them at inference time. Same mechanism. Bigger neighbourhood.

the one idea

Same equation. Same softmax. Same attention. The only thing that changes is what sits next to the question. RAG isn't “teaching” the model anything — it's choosing the neighbourhood the question lives in. Few-shot prompting is the same move with examples instead of documents. Both are inference-time edits to the focus token's context, executed by attention you can already read off the heatmap above.

softmax(Q·Kᵀ / √d_k) · V doesn't care whether the K's came from the sentence, from few-shot examples, or from a vector search over MongoDB. It just blends. Choose the neighbours well — that's the whole game.

behind the gaze Q/K/V vectors behind attention advanced details

Q queries · what this token is looking for

K keys · what each token offers

V values · what gets passed forward

honesty check What real transformers add on top of this 4 things we left out

This demo teaches the core loop faithfully — but a production transformer (GPT-4, Claude, Llama) stacks several more ideas on top. None of them change the intuition above; they amplify it.

1 Multi-head attention

Real models run 32–128 parallel attention heads, each with its own W_Q, W_K, W_V. One head might track syntax, another coreference, another sentiment. Their outputs are concatenated and projected back down. This demo uses a single head so you can follow one complete gaze without distraction.
2 Positional encodings

Attention is order-blind — “cat sat mat” and “mat sat cat” produce identical Q·K scores unless we inject position information. Real models add sinusoidal or learned position vectors to each embedding (or use rotary embeddings) so the model knows where each word sits, not just what it is.
3 Residual connections + layer norm

After attention and after the FFN, a real transformer adds the input back (residual / skip connection) then normalises. This lets gradients flow cleanly through 80+ stacked layers without vanishing. Our single-layer demo doesn't need it, but it's why deep models train at all.
4 Causal mask

Autoregressive language models mask out future tokens so word i can only attend to positions ≤ i. This demo lets every word look at every other word (full bidirectional attention) to keep the heatmap symmetric and easier to read. In a real decoder, the upper triangle of the attention matrix would be −∞ before softmax.

Bottom line: the demo captures the essential mechanism — queries ask, keys advertise, values deliver, softmax chooses. Multi-head, position, residuals, and masking are engineering that makes it scale; they don't change the story.