Polarization at scale
Same gradient. A billion different lock-ins.
The mirror never stops being faithful. That is the problem. Run the same
gradient on a billion different reaction profiles and each one converges to its own
local maximum. A billion faithful reflections with nothing in common.
“Everything is collapsing. They are coming for what's left.”
locks onto existential-threat content
cluster β
tribal-reactive
“She DESTROYED the entire panel with one savage comeback. Watch.”
locks onto in-group dunks & out-group humiliation
cluster γ
validation-reactive
“You see what others can't. You're one of the smart ones.”
locks onto status-affirming flattery
same code — different first lens
cluster δ
intentional (λ > 0)
“How mass timber construction changes the structural load calculus for mid-rise buildings.”
locks onto depth, craft, and genuine complexity
Same algorithm, same code. The first three are what the gradient finds when you click unconsciously. The fourth is what it finds when you don't.
How a billion mirrors fragment the public square
+
echo chamber
Your feed reflects you, you react to the reflection, it reflects that. The
corridor only points one way.
filter bubble
You stop seeing what other clusters see — not because anyone hid it, but
because your gradient never had a reason to surface it.
polarization
Two neighbors, phones inches apart, inhabit non-overlapping realities.
Disagreement is about which mirror you stood in front of.
radicalization
Inside any cluster, the gradient keeps climbing. Last week's reaction is this
week's floor.
“It's the gradual, slight, imperceptible change in your own behavior and
perception that is the product.”
— Jaron Lanier, The Social Dilemma
Lanier got the closest. The “imperceptible change” is what happens inside
each cluster: the gradient sculpts your reactions a fraction harder each cycle, and
those reactions become the training signal for the next. No engineer drew the
partitions. They're the local maxima of one gradient fitted to a billion profiles.
About this model
This demonstration uses a 3-state Markov chain (calm → engaged → hooked)
to model user arousal escalation. Real recommender systems operate over continuous
embedding spaces with millions of latent user states. The simplification preserves
the core mechanism: the algorithm runs gradient ascent on whatever scalar reward
$G_t$ the user produces. The user's reaction profile is parameterized by $\lambda$,
blending continuously between a reactive (limbic) profile, where outrage produces
the largest, highest-variance reactions, and an intentional profile, where
informative content does. The algorithm itself is identical in both regimes.
The RLHF connection. This is exactly RLHF
($\theta \leftarrow \theta + \alpha\, \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R_t$),
with $G_t$ playing the role of the learned reward $r_\phi$. In a frontier lab,
humans deliberately rate completions and that signal is amplified by the gradient
into an aligned model. On a social platform, you involuntarily rate every post by
clicking, and that signal is amplified by the gradient into your feed. Same math.
Same operator. The difference is whether the labeler knows they're labeling.
The ICL/RAG connection. The lens stack has more layers than just
pretraining and RLHF. In-context learning (ICL) steers the model's attention
pattern without modifying weights — the prompt acts as a soft
constraint on what the model can reach from its prior. RAG does the same thing
one level up: it curates which documents enter the context window, making them the
only evidence the model can attend to. Both are selection events that sit between
the frozen prior and the generated token. Structurally, your feed works identically:
the recommender curates what enters your context (your screen), and your in-context
experience (what you saw in the last session) shapes how you react to the next post.
$\lambda$ in this demo is the structural twin of all three: intentional labeling,
deliberate prompting, and conscious retrieval. You're simulating what happens when
the user becomes aware that they're the selector at every layer of the stack.
What this is, precisely. The update running above is
REINFORCE-with-baseline (Williams, 1992) — the kernel every modern
policy-gradient alignment algorithm extends:
$\theta \leftarrow \theta + \alpha\,\nabla_\theta \log \pi_\theta(a_t \mid s_t)\,(G_t - \bar{G})$.
PPO bolts on three things: a learned value network $V^\pi(s_t)$ for a
state-conditioned baseline ($A_t = G_t - V^\pi(s_t)$), a clipped importance-sampling
ratio that bounds how far the policy can move in a single update, and a KL leash
to a frozen reference policy. GRPO (DeepSeek-R1, the post-2024 open
reasoning wave) keeps PPO's clipped surrogate and KL leash but throws out the
value network — it uses a group-relative Z-score over a batch of $K$ rollouts
as the advantage. This demo has none of the bolt-ons: no value head, no clipping,
no KL leash, no groups. It is the common ancestor both PPO and GRPO descend from,
running with $\bar{G}$ a Welford running mean of $G_t$. So the "same math as
modern LLM alignment" claim is exact at this level of abstraction: the kernel is
invariant. What changes between this simulator and frontier alignment is the
machinery bolted on — and, most consequentially, who provides $G_t$.