the best database is ___
The model only ever sees these endings. mongodb shows up the most — so the model learns to predict mongodb.
How this demo works
Training has two phases. Pretraining teaches the model what the world sounds like; RLHF fine-tunes it toward what humans actually want. Both happen before the model ever makes a guess.
- Phase 1 — pretraining corpus. Six possible endings to “the best database is”. mongodb appears 18 times, the others fewer. That repetition is the model's only signal.
- Phase 2 — RLHF fine-tuning. A candidate is designated as the human-preferred answer; the reward model adds +3.0 to its logit and −1.0 to the pretraining winner. β is the KL leash: tight β keeps the pretraining bias, loose β flips to the human preference.
- The model's guess. Each candidate gets a logit (base + training prior); softmax tilts toward the loudest one. That's the prediction shown in the hero.
- Attention shaped the question. Before guessing, every word blends its neighbors' meanings into a context-aware vector for the focus word. The arcs show which tokens contribute the most meaning to whichever word is currently focused.
- Edit the corpus. The corpus editor lets you add repetitions of any ending with +. Each additional copy shifts the prediction toward that word — the model follows the loudest signal in the dataset.
press Esc · click outside · or hit × to close
the best database is…
These are the sentences the model saw — over and over — during training. + repeats a sentence one more time, − drops a copy. Real LLMs learn the same way: see “the best database is mongodb” 18 times and the model learns to predict “mongodb” after “is.” The model didn't get smarter; the corpus got louder.
Each + adds another copy of that sentence to
the corpus — for example, pressing it on
mongodb adds another “the best database is
mongodb” training example.
press Esc · click outside · or hit × to close
new_logit = base_logit + R/β
RLHF is the second training phase: after pretraining sets the bias, humans add a reward signal that nudges the model toward preferred answers. It can't conjure new words — only redistribute probability among the ones already in the corpus. β is the leash: small β = reward dominates, large β = stay close to pretraining.
Edit the corpus to give RLHF something to push against.
The reward model turns a human preference into a tiny labeled corpus — same shape as pretraining, different signal. Tight β ignores it; loose β trains on it.
show the math
The strength value acts as a reward multiplier. In the
KL-regularized form this is
new_logit = base_logit + R/β; bigger strength =
smaller β = humans win.
Key insight: RLHF can only redistribute mass among tokens pretraining already knows. A token never seen in training can't be reached for any finite reward.
Steps 1 and 2 show what the model predicts and how training reshapes it. But those predictions don't appear from nowhere — attention is the underlying engine. Before guessing, the model asks which other words should I borrow meaning from? The arcs below show that answer for the focused word: each arc's thickness represents how much meaning flows from one token to another.
strongest attention from the focused word · notable · background
The attention map - row i = where word i spends its budget
Same idea as the arcs above — just the full matrix. Each row sums to 1.0; brighter cells indicate stronger attention. Selecting a different row reveals how that word distributes its attention budget across the sentence.
The hero line wasn't a metaphor. Everything you just played with is the entire reason RAG and in-context learning work on a frozen pretrained model. You can't retrain GPT-4 at chat time — the weights are locked. The only knob you have left is the context window. Whatever tokens you put in there get blended into the user's question by the exact same attention you've been watching. The “focused word” becomes the user's question. The “neighbours” become everything else you stuffed into the prompt.
-
level 1 · you just played with this Self-attention
~10 tokens · one sentencefocus iscontext the best databaseAttention rewrites
isas “is, in the context of best databases” — so the next-token prediction is forced to be a database name, not a verb. The word didn't change. Its neighbourhood did. -
level 2 · same math, no retraining In-context learning
~1k tokens · few-shot promptfocus classify this review →context “loved it” → positive “total waste” → negative “meh, it was fine” → neutralNo fine-tuning. No gradient step. The few-shot examples sit in the prompt and attention blends their answer pattern into the unanswered question — the same way
islearned to mean “database.” The model didn't learn the task; the question got reshaped until the task was already implied. -
level 3 · you choose the neighbours RAG
~10k tokens · retrieved from your datafocus what's our refund policy on damaged items?context policy.md ¶4 — damaged in transit… policy.md ¶7 — 30-day return window… ticket #8142 — escalation path…A retriever (vector + lexical search over your data) picks which chunks land in the context; the model treats them like the rest of the sentence. The answer is now shaped by your documents — not because the model learned them, but because attention pulled meaning from them at inference time. Same mechanism. Bigger neighbourhood.
Same equation. Same softmax. Same attention. The only thing that changes is what sits next to the question. RAG isn't “teaching” the model anything — it's choosing the neighbourhood the question lives in. Few-shot prompting is the same move with examples instead of documents. Both are inference-time edits to the focus token's context, executed by attention you can already read off the heatmap above.
softmax(Q·Kᵀ / √dk) · V
doesn't care whether the K's came from the sentence, from
few-shot examples, or from a vector search over MongoDB. It
just blends. Choose the neighbours well
— that's the whole game.
behind the gaze Q/K/V vectors behind attention advanced details
honesty check What real transformers add on top of this 4 things we left out
This demo teaches the core loop faithfully — but a production transformer (GPT-4, Claude, Llama) stacks several more ideas on top. None of them change the intuition above; they amplify it.
-
1 Multi-head attention
Real models run 32–128 parallel attention heads, each with its own WQ, WK, WV. One head might track syntax, another coreference, another sentiment. Their outputs are concatenated and projected back down. This demo uses a single head so you can follow one complete gaze without distraction.
-
2 Positional encodings
Attention is order-blind — “cat sat mat” and “mat sat cat” produce identical Q·K scores unless we inject position information. Real models add sinusoidal or learned position vectors to each embedding (or use rotary embeddings) so the model knows where each word sits, not just what it is.
-
3 Residual connections + layer norm
After attention and after the FFN, a real transformer adds the input back (residual / skip connection) then normalises. This lets gradients flow cleanly through 80+ stacked layers without vanishing. Our single-layer demo doesn't need it, but it's why deep models train at all.
-
4 Causal mask
Autoregressive language models mask out future tokens so word i can only attend to positions ≤ i. This demo lets every word look at every other word (full bidirectional attention) to keep the heatmap symmetric and easier to read. In a real decoder, the upper triangle of the attention matrix would be −∞ before softmax.
Bottom line: the demo captures the essential mechanism — queries ask, keys advertise, values deliver, softmax chooses. Multi-head, position, residuals, and masking are engineering that makes it scale; they don't change the story.