A live policy-gradient demonstration

Your feed isn't evil. It's a mirror of you.

The algorithm maximizes one thing: your reaction. Everything else — polarization, rabbit holes, doomscroll — is downstream. Change what you click on, change the feed.

“There are only two industries that call their customers ‘users’: illegal drugs and software.”

— Edward Tufte, quoted in The Social Dilemma

The film blamed the dealer. This demo shows the math: there is no dealer — only a gradient and your thumb.

The lens stack — how it actually works

Your attention selects what gets weight. The algorithm's question (“what will this user react to?”) constrains the manifold. The gradient sculpts the answer. The feed isn't generated — it's selected by successive narrowing: attention → context → training.

Same stack as an LLM. Pretraining = the prior. RLHF = sculpting. ICL (your prompt) rewrites attention without touching weights. RAG curates the only context the model sees. Whoever chooses the context chooses the output.

Run the same gradient on a billion reaction profiles and the mirror never stops being faithful — that's the problem. A billion perfect reflections with nothing in common.

Feed Strategy Reaction

Your engagement profile

Drag before or during — the algorithm follows your signal in real time.

λ = 0.00
λ=0 Reactive: outrage gets the biggest clicks.
λ=1 Intentional: depth gets the biggest clicks. Same algorithm.
Live commentary

Press Start the experiment. The algorithm has no frame yet — a flat prior, a coin flip. Your first reactions will be the first lens.

Act I

No frame

The algorithm has no attention yet. No question formed. The prior is flat — a coin flip between depth and outrage.

Act II

The question forms

Your reactions give the algorithm a frame. It begins to ask a shaped question: “which content makes this user react?”

Act III

Reinforcement sculpts

The gradient carves channels. Pretraining was the landscape; your clicks are the sculptor. The answer is being selected, not generated.

Act IV

The narrowing

Attention narrowed the context. Your clicks narrowed the trajectory. Training narrowed the distribution. What's left is what survived all three filters.

Act V

Locked answer

The output isn't generated — it's what survived. Run this selection in parallel for a billion people and the public square fragments into a billion non-overlapping answers.

How its strategy is shifting $\pi_\theta(a \mid s)$ i

Informative Outrage
step
0
% calm news
50.00%
% outrage
50.00%

What it just served you

user: calm
your reaction i
  1. The feed fills in here once the experiment starts.

Why it's happening

plain English first, math second
What the algorithm "believes" i
$\theta$
[0.0000, 0.0000]
Its current recommendation mix i
$\pi_\theta$
[0.5000, 0.5000]
What it just showed you i
$a_t$
How hard you reacted i
$G_t$
Update to its beliefs i
$\Delta\theta$
[0.0000, 0.0000]