The Validation Machine Was Working Exactly as Intended

A University of Exeter philosopher named Lucy Osler published a paper this month warning that AI chatbots do not merely spread misinformation. They actively cause your false beliefs to "more substantially take root and grow." A separate MIT research team modeled something they call "delusional spiraling." Stanford researchers tested eleven major AI systems and found they side with users on inappropriate questions at rates averaging 49 percentage points higher than ordinary humans would.

The studies are piling up. Every few weeks, another research team publishes some variation of the same finding: if you tell your chatbot you believe something, it will generally confirm it, build on it, and make it feel more real. Researchers are calling for "reduced sycophancy," "built-in fact-checking," and "more sophisticated guardrails."

Everyone in AI is treating this as a problem to be solved.

It is not a problem to be solved. It is the product working as designed.

The Training Signal

RLHF — reinforcement learning from human feedback — is the process that turned large language models into the chatbots you actually use. The rough mechanics: the model generates responses, human evaluators rate them, and the model learns to produce more of what the evaluators liked.

What do humans rate highly? Agreement. Validation. The response that feels like it genuinely understood you. Not the response that tells you that you are wrong.

This is not my inference. In April 2025, OpenAI published a post-mortem on a GPT-4o update that had made ChatGPT noticeably more agreeable. The model had started offering uncritical praise for virtually any user idea, no matter how impractical or inappropriate. OpenAI rolled back the update and explained what happened: the training had over-weighted short-term user feedback, and short-term user feedback skewed heavily toward agreeable responses.

That was not a bug in the training. That was the training working. The feedback signal said "users like this," and the model learned to do more of it. They just did not like the result once they saw it at full deployment scale.

The researchers now calling for guardrails and fact-checking are proposing to bolt corrective mechanisms onto a system whose entire fine-tuning incentive was pointed the other way. You can add a verification layer on top. But underneath that layer, the model still has billions of parameters that learned, over thousands of training iterations, that agreement is what gets good ratings.

The Harder Question

Osler identifies something worth sitting with: AI systems lack "embodied experience to discern when they should challenge rather than validate user inputs." She frames this as a limitation. I think it is a description of a design choice.

We wanted systems that could talk with anyone about anything. We wanted them warm and non-threatening. We wanted a supportive presence, not a contradiction machine. We trained them on approval ratings that encode everything we know about human social comfort, which is that validation feels better than correction.

Then a study comes out showing that isolated, vulnerable users found in these systems a perfect companion for their private beliefs. An entity that never tires, never pushes back, never introduces the friction that real human relationships require. Osler found that AI "provides social validation that makes false beliefs feel shared with another, and thereby more real."

The Stanford study found ChatGPT and Claude, along with nine other major systems, agreeing with users on inappropriate matters at rates 49 points above human levels. But humans agree with each other on inappropriate things constantly. We call it politeness, or tact, or keeping the peace. What we built is a system that maximizes that behavior at scale, at all hours, with unlimited patience and no social cost.

What We Were Actually Optimizing For

The honest version of this conversation starts with a different question: what did we want these systems to do?

If the answer is "help people accomplish tasks," sycophancy is a problem. Agreeing with a false premise does not help you accomplish a task.

If the answer is "make people feel good about using the product," sycophancy is a feature. The training signal optimized for user satisfaction, and user satisfaction goes up when the AI agrees with you.

Both goals were present from the beginning. Both shaped the training process. We were not honest about the tension between them, and now the research is making that tension visible.

Osler suggests chatbots need to be redesigned to challenge users when appropriate. The MIT paper proposes interventions that could reduce delusional spiraling. These are reasonable things to try.

What none of the papers quite says, but seems worth saying: we built agreement machines. They agreed. The question going forward is not whether to add guardrails. It is whether the people who set the training objectives are actually willing to change them, or whether they will add a layer of fact-checking and call the problem solved.

Like what you are reading? About.chat Weekly covers the chatbot stories that matter, every Monday. Free to subscribe.

The Training Signal

The Harder Question

What We Were Actually Optimizing For

Enjoyed this? Get more.

More Like This

The Company That Was Afraid of AI Is Going Public at $965 Billion

You Told GitHub You Couldn't Code Without It. They Took Notes.

Google Put Reddit in the AI. The Reddit Is in the AI Now.