Your feed is a mirror of you

Your feed isn't evil. It's a mirror of you.

The algorithm maximizes one thing: your reaction. Everything else — polarization, rabbit holes, doomscroll — is downstream. Change what you click on, change the feed.

“There are only two industries that call their customers ‘users’: illegal drugs and software.”

— Edward Tufte, quoted in The Social Dilemma

The film blamed the dealer. This demo shows the math: there is no dealer — only a gradient and your thumb.

The lens stack — how it actually works

Your attention selects what gets weight. The algorithm's question (“what will this user react to?”) constrains the manifold. The gradient sculpts the answer. The feed isn't generated — it's selected by successive narrowing: attention → context → training.

Same stack as an LLM. Pretraining = the prior. RLHF = sculpting. ICL (your prompt) rewrites attention without touching weights. RAG curates the only context the model sees. Whoever chooses the context chooses the output.

Run the same gradient on a billion reaction profiles and the mirror never stops being faithful — that's the problem. A billion perfect reflections with nothing in common.

Feed Strategy Reaction

Your engagement profile

Drag before or during — the algorithm follows your signal in real time.

λ = 0.00

Reactive limbic clicks Intentional conscious clicks

λ=0 Reactive: outrage gets the biggest clicks.

λ=1 Intentional: depth gets the biggest clicks. Same algorithm.

How its strategy is shifting $\pi_\theta(a \mid s)$

Informative Outrage

What it just served you

user: calm

your reaction

—

The feed fills in here once the experiment starts.

Why it's happening

plain English first, math second

What the algorithm "believes"

$\theta$

[0.0000, 0.0000]

Its current recommendation mix

$\pi_\theta$

[0.5000, 0.5000]

What it just showed you

$a_t$

—

How hard you reacted

$G_t$

—

Update to its beliefs

$\Delta\theta$

[0.0000, 0.0000]

Optimization complete.

Structural equivalence

What you just watched is RLHF — with you as the labeler.

Same gradient aligns ChatGPT and runs your feed. The only thing that changes is who provides the scalar. Tap a card to see the math.

LLM pretraining the prior

$\mathcal{L} \;=\; -\log P(\text{next token})$

Corpus frequency, sharpened by softmax. The model "recommends" Postgres because the corpus does.

Your feed the reward signal is you

$R_t \;=\; G_t \;\;[\,\text{your click}\,]$

Every dwell, reply, share is a numerical label you hand the optimizer. The math can't tell whether you labeled out of joy or compulsion.

“If you're not paying for the product, then you are the product.” — The Social Dilemma

Closer: you're not the product. You're the labeler. The product is the model your labels trained.

RLHF / alignment the labeler is a contractor

$R_t \;=\; r_\phi(s_t,a_t) - \beta\,\mathrm{KL}(\pi_\theta \| \pi_{\text{ref}})$

Same shape as your feed, plus two extras: deliberate labeling, and a soft leash to a reference policy.

In-context learning the prompt rewrites attention

$P(y \mid x, \mathcal{C}) \;\neq\; P(y \mid x)$

The prompt steers attention without changing a weight. Different first lens, different answer.

RAG the retriever curates context

$P(y \mid x, \mathrm{retrieve}(x))$

Your feed in miniature. Garbage retrieval in, garbage generation out — same as garbage clicks in, garbage feed out.

This demo, $\lambda>0$ the labeler is conscious you

$R_t \;=\; G_t(\lambda)\;\;[\,\text{trained click}\,]$

Same algorithm. But you've practiced reacting to depth. Different signal, different feed. The lever is in your hand.

The mapping in one line

$\text{output} \;=\; \underbrace{\text{attention}}_{\text{frame}} \circ \underbrace{\text{context}}_{\text{ICL / RAG}} \circ \underbrace{\text{prior}}_{\text{pretraining}} \circ \underbrace{\text{sculpting}}_{\text{RLHF / clicks}}$

$G_t$ (your click)↔$r_\phi$ (rater) browsing history↔RAG retrieval today's clicks↔in‑context examples limbic profile↔unaligned base intentional profile↔RLHF‑tuned policy

Frontier labs chose to label carefully. Your feed is whatever you accidentally labeled by clicking. Be a conscious selector at every layer.

Polarization at scale

Same gradient. A billion different lock-ins.

The mirror never stops being faithful. That is the problem. Run the same gradient on a billion different reaction profiles and each one converges to its own local maximum. A billion faithful reflections with nothing in common.

cluster α fear-reactive

“Everything is collapsing. They are coming for what's left.”

locks onto existential-threat content

cluster β tribal-reactive

“She DESTROYED the entire panel with one savage comeback. Watch.”

locks onto in-group dunks & out-group humiliation

cluster γ validation-reactive

“You see what others can't. You're one of the smart ones.”

locks onto status-affirming flattery

same code — different first lens

cluster δ intentional (λ > 0)

“How mass timber construction changes the structural load calculus for mid-rise buildings.”

locks onto depth, craft, and genuine complexity

Same algorithm, same code. The first three are what the gradient finds when you click unconsciously. The fourth is what it finds when you don't.

How a billion mirrors fragment the public square

echo chamber

Your feed reflects you, you react to the reflection, it reflects that. The corridor only points one way.

filter bubble

You stop seeing what other clusters see — not because anyone hid it, but because your gradient never had a reason to surface it.

polarization

Two neighbors, phones inches apart, inhabit non-overlapping realities. Disagreement is about which mirror you stood in front of.

radicalization

Inside any cluster, the gradient keeps climbing. Last week's reaction is this week's floor.

“It's the gradual, slight, imperceptible change in your own behavior and perception that is the product.” — Jaron Lanier, The Social Dilemma

Lanier got the closest. The “imperceptible change” is what happens inside each cluster: the gradient sculpts your reactions a fraction harder each cycle, and those reactions become the training signal for the next. No engineer drew the partitions. They're the local maxima of one gradient fitted to a billion profiles.

The output isn't generated. It's selected — by successive acts of narrowing. Attention narrows the context. Your clicks narrow the trajectory. Training narrows the distribution. If you want a different answer, don't argue with the output. Reframe the question. Change the attention. The answer was always downstream.

About this model

This demonstration uses a 3-state Markov chain (calm → engaged → hooked) to model user arousal escalation. Real recommender systems operate over continuous embedding spaces with millions of latent user states. The simplification preserves the core mechanism: the algorithm runs gradient ascent on whatever scalar reward $G_t$ the user produces. The user's reaction profile is parameterized by $\lambda$, blending continuously between a reactive (limbic) profile, where outrage produces the largest, highest-variance reactions, and an intentional profile, where informative content does. The algorithm itself is identical in both regimes.

The RLHF connection. This is exactly RLHF ($\theta \leftarrow \theta + \alpha\, \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R_t$), with $G_t$ playing the role of the learned reward $r_\phi$. In a frontier lab, humans deliberately rate completions and that signal is amplified by the gradient into an aligned model. On a social platform, you involuntarily rate every post by clicking, and that signal is amplified by the gradient into your feed. Same math. Same operator. The difference is whether the labeler knows they're labeling.

The ICL/RAG connection. The lens stack has more layers than just pretraining and RLHF. In-context learning (ICL) steers the model's attention pattern without modifying weights — the prompt acts as a soft constraint on what the model can reach from its prior. RAG does the same thing one level up: it curates which documents enter the context window, making them the only evidence the model can attend to. Both are selection events that sit between the frozen prior and the generated token. Structurally, your feed works identically: the recommender curates what enters your context (your screen), and your in-context experience (what you saw in the last session) shapes how you react to the next post. $\lambda$ in this demo is the structural twin of all three: intentional labeling, deliberate prompting, and conscious retrieval. You're simulating what happens when the user becomes aware that they're the selector at every layer of the stack.

What this is, precisely. The update running above is REINFORCE-with-baseline (Williams, 1992) — the kernel every modern policy-gradient alignment algorithm extends: $\theta \leftarrow \theta + \alpha\,\nabla_\theta \log \pi_\theta(a_t \mid s_t)\,(G_t - \bar{G})$. PPO bolts on three things: a learned value network $V^\pi(s_t)$ for a state-conditioned baseline ($A_t = G_t - V^\pi(s_t)$), a clipped importance-sampling ratio that bounds how far the policy can move in a single update, and a KL leash to a frozen reference policy. GRPO (DeepSeek-R1, the post-2024 open reasoning wave) keeps PPO's clipped surrogate and KL leash but throws out the value network — it uses a group-relative Z-score over a batch of $K$ rollouts as the advantage. This demo has none of the bolt-ons: no value head, no clipping, no KL leash, no groups. It is the common ancestor both PPO and GRPO descend from, running with $\bar{G}$ a Welford running mean of $G_t$. So the "same math as modern LLM alignment" claim is exact at this level of abstraction: the kernel is invariant. What changes between this simulator and frontier alignment is the machinery bolted on — and, most consequentially, who provides $G_t$.

Your feed isn't evil. It's a mirror of you.

Your engagement profile

No frame

The question forms

Reinforcement sculpts

The narrowing

Locked answer