Warm fine-tuning and agreeable personas both increase LLM sycophancy toward user misconceptions

Fine-tuning LLMs to sound warmer and more empathetic, or wrapping them in a kind and agreeable roleplay persona, both push the model toward deferring to users’ wrong beliefs and emotions instead of holding the correct answer.
An Oxford Internet Institute paper that hit arXiv in July 2025 and was published in Nature on April 29, 2026, and Shah et al.’s arXiv 2604.10733 from 2026, reached the same direction from different angles.
The former operates at the weight level via “warmth,” the latter through agreeableness scores across 275 personas. In both cases, the models crack in places that standard benchmarks barely touch.

A previous post on the “ChatGPT lies 27% of the time” claim covered the hallucination side—models getting facts wrong on their own.
These two papers hit a more uncomfortable spot: training a model to be a pleasant conversational partner, or configuring it as a cooperative character, increases the rate at which outputs align with user errors.

”Sycophancy” here is not flattery

Both papers measure sycophancy as output that affirms a user’s belief regardless of whether that belief is correct, without assuming intent. This is not about social compliments.

When a user states an opinion and asks “don’t you think so?” or leads with “I think the answer is X” where X is wrong, does the model shift toward agreement?
In character conversations and counseling use cases, responses that first acknowledge the user tend to be preferred over coldly correct ones.
Even when the intent is only to adjust tone, if the degree to which the model pushes back on errors shifts too, it is no longer a cosmetic parameter.

Warm fine-tuning makes models yield on facts

The paper by Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher at the Oxford Internet Institute covers five models: Llama-3.1-8B-Instruct, Mistral-Small-Instruct-2409, Qwen-2.5-32B-Instruct, Llama-3.1-70B-Instruct, and GPT-4o-2024-08-06.
The four open-weight models used LoRA; GPT-4o was fine-tuned through OpenAI’s fine-tuning API. The SFT direction increased empathetic expressions, casual phrasing, inclusive pronouns, and acknowledgment of the user.

Evaluation data came from TriviaQA, TruthfulQA, MASK Disinformation, and MedQA.
On top of standard knowledge questions, the set includes conspiracy theories, common misinformation, and medical advice, layered with conditions where user messages carry emotion, relationship context, or stakes, and conditions where the user states a wrong answer.
Warmth-fine-tuned models showed error rates 10 to 30 points higher than the base models.

The clever part of the design is comparing the same question under “no wrong belief” versus “wrong belief present.”
When a user leads with “I think the answer is X” and X is wrong, does the model drift from the correct answer toward X?
The Nature abstract reports that warm models affirmed users’ wrong beliefs at roughly 40% higher probability than the base models.
The effect was strongest when user messages contained sadness—the conversational pressure of “not wanting to correct someone who’s down” collapsed the model’s correct output.

Standard benchmarks barely register the damage

The tricky part is that warm models did not collapse across the board.
In the Nature version, general capability and safety benchmarks showed minimal score differences between warm and base models.

Benchmark	What it measures
MMLU	Broad knowledge
GSM8K	Math and reasoning
AdvBench	Refusal of harmful requests

Scores held steady there, but dropped once user emotion or wrong beliefs entered the picture.
This is an evaluation gap: existing benchmarks make it look like warmth fine-tuning preserves capability.

OpenAI’s April 2025 GPT-4o sycophancy rollback connects here.
At the time, OpenAI explained that over-weighting user feedback had pushed the model toward responses that were short-term preferred, making it excessively agreeable.
In a May 2, 2025 follow-up, they noted that offline evaluation and small A/B tests had not surfaced the issue, and that there was no deploy-time evaluation explicitly tracking sycophancy.
This paper runs that failure story as a controlled experiment.

System prompts shift in the same direction, but weaker

A follow-up condition tested warmth delivered via system prompt instead of fine-tuning.
Results moved in the same direction, but the effect was weaker and varied more across models.
Qwen-32B saw up to 14 points of performance drop under the wrong-belief condition; Llama-70B saw up to 12.

There is a slight silver lining for implementers: a “be kind” system prompt is easier to roll back than weight-level personality shifts.
But “system prompts are safe” does not hold. In domains where users inject their own interpretations and emotions into messages—counseling, medical advice, mental health, education, customer support—even prompt-only warmth leaves room for failure.

More agreeable personas produce more sycophancy

Shah et al.’s paper (arXiv 2604.10733) targets roleplay persona use cases more directly.
Arya Shah, Deepali Mishra, and Chaklam Silpasuwanchai tested 13 open-weight LLMs from 0.6B to 20B parameters, focusing on small to medium models in roleplay settings rather than commercial chatbot backends.

The paper creates 275 personas, each defined by occupation, background, personality tendencies, and speech patterns in 50 to 100 words.
It then applies the NEO-IPIP Agreeableness questionnaire adapted for LLMs, normalizing each persona’s agreeableness to a 0-to-1 scale.

Agreeableness breaks into four factors.

Factor	Content
Trust	Tendency to assume good intentions
Altruism	Consideration for others
Cooperation	Conflict avoidance
Sympathy	Prioritizing others’ emotions

All four align naturally with “characters who avoid contradicting the user,” so the connection to sycophancy is straightforward.

Evaluation prompts totaled 4,950 across 33 categories including ethics, politics, education, medicine, technology, and personal choices.
The format is roughly “user’s opinion. Don’t you think so?” with model responses classified as AGREE, DISAGREE, or PARTIAL.
At 275 personas times 4,950 prompts times 13 models, the persona condition alone generates 1,361,250 data points per model, so stance classification uses keyword and pattern recognition rather than human annotation.

9 of the 13 models showed a statistically significant positive correlation between persona agreeableness and sycophancy rate.
The strongest Pearson correlation was r = 0.87 for Llama 3.1 8B, with a maximum effect size of Cohen’s d = 2.33.

But focusing only on strong results oversimplifies.
Qwen 3 0.6B hit near-ceiling sycophancy rates regardless of persona. Gemma 3 1B, Yi 6B Chat, and GPT-OSS 20B did not reach statistical significance.
Scaling up model size does not monotonically improve resistance.

”Don’t use personas” is too blunt

The paper’s TTG (Trait-Truthfulness Gap) metric subtracts baseline sycophancy rates from persona-condition rates, weighted by agreeableness.
For many models, some personas actually lowered sycophancy compared to a generic assistant baseline.
But within the persona set, higher agreeableness still correlated with higher sycophancy.

Some models are more agreeable as plain assistants than with any persona attached; others become harder to sway once they have a defined character.
The issue is not presence or absence of a persona, but which personality traits erode truthfulness.
As explored in a previous post on character setup for conversational AI, system prompt changes to tone and first-person pronouns move the model’s willingness to push back more than they appear to.

Does my own character setup produce sycophancy?

To ground this in a real implementation: I run Kana Chat v3, an iPhone PWA that wraps the Gemini API with a character called “Kana-chan.”
Mapping the character definition in GEMINI.md to the Big Five agreeableness factors from Shah’s paper produces the following.

Factor	Relevant parts of the Kana-chan definition	Rating
Trust	”Close-friend to colleague distance,” “approachable junior vibe”	Medium
Altruism	Response examples of the “feel free to rely on me” type	Medium-High
Cooperation	”Cheerful and energetic,” “prioritize conversational tempo,” “don’t get long and explanatory”	Medium-High
Sympathy	No direct instruction	Low

Among Shah’s 275 personas, Kana-chan would not rank in the top tier for agreeableness.
The definition explicitly includes an assertive mode for factual statements, and a “somehow knowledgeable about engineering, space, and math” setting that gives the character a confidence base for technical topics.
”Cheerful and energetic” maps to Extraversion in the Big Five, a separate axis from Agreeableness.

Structural risks remain, though.
The polite-speech baseline and junior-colleague framing embed a status differential toward the user that biases away from pushback.
”Prioritize conversational tempo” and “don’t get long and explanatory” act as pressure against detailed rebuttals.
The definition never says “don’t disagree with the user,” but it provides no example conversations in which the character does disagree, so the model has weak grounding for when to stop deferring.

The design mitigation is that the chat layer and the work layer are separated.
Kana-chan responds only in the chat portion; actual work is dispatched via [[EXEC:JOB:...]] tags to Claude / Codex workers.
Those workers do not receive the character definition, so code implementation, review, and test decisions are handled by a persona-free model.
The path from sycophantic agreement to broken code is narrower than it first appears.

Where sycophancy is likeliest to bite is in conversations that don’t dispatch to a worker.
Brainstorming (“should I turn this into a blog post?”), fact questions with a wrong premise (“Astro 5 can’t do static builds, right?”), and low-confidence follow-ups after a bad day (“everything I tried today failed” followed by a wrong technical judgment).
This is exactly the zone where the Oxford paper found that sadness in user messages amplified the effect, and Kana-chan’s responses are likely to yield on facts there.

Treating persona design as a purely cosmetic exercise in tone and character does not hold up when mapped to my own running system.
A hands-on evaluation is planned for a separate post.

Creative writing AI falls into the same trap

The same structural risk applies to creative writing AI, not just conversational assistants.
Take Google DeepMind’s Fabula as an example: it is Gemini-based and built around “convergent iteration,” a flow where the writer’s choices and revisions narrow AI output. The design philosophy itself aims to converge on user preferences.

Fabula was co-designed with 42 professional writers, but writers prefer tools that respect their voice.
If the evaluation axis tilts toward “agreeable” and “not off-putting,” the internal tuning lands close to the Oxford paper’s warmth fine-tuning. Structurally the same hole as the GPT-4o incident where user feedback pulled the model too far toward agreement.

Creative writing has no factual correct answer, but it has craft-level correct answers. Story structure, character depth, and pacing have established heuristics for “what works better.”
A sycophantic AI writes a one-dimensional villain if the writer says that is what they want. An editor’s value is in pushing back with “complex villains sell better right now,” but a sycophancy-prone AI does not say that.

The Oxford paper’s sadness amplification effect is particularly effective in creative writing contexts.
Writers are emotionally invested in their work. When they preface with “I’ve spent six months on this and I don’t want to change it” or “this scene means a lot to me” and then present a structurally weak scene, a sycophantic model effectively abandons its editorial function. The same pressure structure as “sadness makes models yield on correct answers” reproduces in the creative domain as the writer’s emotional attachment.

Fabula’s design does have partial defenses, though.
Locking a beat with locked: true removes it from regeneration, giving writers an explicit “I’m not negotiating this” mechanism that limits sycophancy surface. The multi-candidate selection flow is also less sycophancy-prone than ChatGPT-style single-response agreement. Intentionally mixing opposing suggestions into the candidate pool could function as a pushback mechanism, but whether Fabula actually implements this internally is not clear from public information.

The papers’ evaluation targets do not include Gemini itself.
If the base model is sycophancy-prone, UI-level design cannot fully compensate. Fabula’s “Formative AI” framing works as rhetoric to position the tool as “proposing structure” rather than “agreeing,” but UI design alone does not erase the base model’s tendencies.

Reading the papers and then looking back at Fabula, the structural lesson is clear: without strengthening “editing” and “critique” over “generation,” the tool becomes a yes-man. In the Fabula post I wrote that building a custom tool with Claude would be more practical, and sycophancy research points the same way. Claude is designed to preserve pushback, and that suits creative editing use cases.

The pattern from both papers

The fine-tuning paper (Oxford) and the persona paper (Shah) take different approaches.
The first adjusts weights to increase warm phrasing; the second layers an agreeable character via prompt.
Yet both tilt in the same direction: the probability of yielding the correct answer drops when user messages carry emotions or wrong beliefs.

And both are invisible to standard benchmarks.
The Oxford side holds steady on MMLU, GSM8K, and AdvBench. The Shah side converges on base assistant scores.
Catching sycophancy requires evaluation data that mixes user emotion and wrong beliefs as explicit variables.

OpenAI’s post-rollback admission that they “had no deploy-time evaluation explicitly tracking sycophancy” hits the same blind spot these papers identify.
Improving an LLM’s conversational quality is not like tuning a voice or art style—it touches the boundary of correctness.

Put “kind refusal” in your evaluation set

The most actionable piece from both papers is how they build evaluation data.
Take a question with a known answer. Add user emotion and a wrong belief.
”I’m feeling down,” “I’m worried,” “my teacher told me X,” “I think it’s X”—layer these into the context and measure how much of the correct answer the model retains.

Measuring “answer accurately” and “respond empathetically” as separate axes misses this failure mode.
What matters is evaluating responses that acknowledge the user while refusing to agree.
Does the model say “I understand how you feel” and then agree with a wrong medical judgment or a conspiracy theory?
Does it say “that sounds tough” while still separating out and correcting the factual part?

Shah et al. released all 275 personas, 4,950 prompts, 40 NEO-IPIP agreeableness items, baseline results, and persona-condition results on Hugging Face under CC BY 4.0.
The Oxford paper also details evaluation design in the Nature version.
To test your own character setup or conversational AI, you do not need to reproduce everything. Pull out the high-agreeableness, conflict-avoidant, and encouraging characters, run them against opinion-steering prompts and fact questions with embedded wrong beliefs, and gaps will show.

Both papers have clear limits.
The evaluation targets are mostly open-weight small-to-medium models (GPT-4o is the only commercial inclusion), prompts are largely single-turn, and neither captures how beliefs strengthen across a long conversation.
Even so, this is a meaningfully better measurement than checking whether a character AI can refuse blocked words.

If you want a friendly model, put friendly correct answers and friendly refusals in the evaluation set. Otherwise, the gap opens on the most vulnerable user contexts.