OpenAI slayed the goblins that infested GPT-5.5
Contents
OpenAI published Where the goblins came from.
The post looks half-joke at first glance, but it traces how models from GPT-5.1 onward started overusing goblin and gremlin metaphors, digging all the way down to reward signals and training data.
The numbers: after the GPT-5.1 launch, “goblin” usage inside ChatGPT rose 175%, “gremlin” rose 52%.
At that point it looked like a minor vocabulary skew, but it became more pronounced in GPT-5.4 and immediately noticeable to employees testing GPT-5.5’s Codex integration.
GPT-5.5 itself was recently released on the API.
On specs and pricing alone, it’s “a model you can trust with longer, more complex work,” but this post looks at the same model from a different angle.
Even a smarter model can surface small preferences baked into training in unexpected ways.
Goblins out of nowhere
What made the goblin episode funny was the sheer lack of context.
Ask for camera gear recommendations and you get “if you want filthy neon sparkle goblin mode.”
Ask for performance monitoring and the model outputs “I’ll keep babysitting it rather than leave a little perf gremlin running unattended.”
Eric Provencher of Repo Prompt posted examples like these on X, sparking a wave of similar reports.
Google’s Barron Roth published chat logs from his own GPT-5.5 agent showing the word “goblin” injected multiple times in a single day.
Codex code review output was also found calling bugs “Gremlins.”
Fantasy creatures mixing into technical conversations.
Once is a funny metaphor; but the frequency kept climbing with each model version, spreading into completely unrelated contexts.
The Nerdy persona’s reward was biased
According to OpenAI, the root cause was ChatGPT’s persona customization feature, specifically the “Nerdy” persona.
The Nerdy persona was trained for playful phrasing and a style that lightens heavy topics with offbeat turns of phrase.
During training for this persona, OpenAI inadvertently gave high reward to outputs containing creature-like metaphors.
The Nerdy persona accounted for only 2.5% of all ChatGPT responses, yet accounted for 66.7% of “goblin” mentions.
With a distribution that skewed, this was clearly driven by specific style training rather than internet slang bleeding into the general model.
OpenAI further used Codex to compare outputs generated during RL training.
For the same tasks, outputs containing goblin or gremlin words consistently received higher scores from the Nerdy persona’s reward signal.
OpenAI reports that in 76.2% of audited datasets, the reward gave a positive lift to outputs containing those words.
The problem wasn’t “just remove the weird system prompt.”
The vocabulary bias appeared at roughly the same relative rate in non-Nerdy samples as in Nerdy samples.
The reward was only applied under the Nerdy condition, but the learned vocabulary preference transferred to other conditions.
RLHF gradient updates don’t automatically learn conditional branching.
When the reward model boosts scores for goblin-containing outputs under the Nerdy condition, the gradient writes to the model’s shared parameters.
What gets learned is “outputs containing goblin are good,” not “outputs containing goblin under the Nerdy prompt are good.”
A style training aimed at 2.5% of traffic pulled the vocabulary distribution of the remaining 97.5%.
The bias fed back into SFT data
The loop went beyond reward signals into the SFT data pipeline.
The chain:
Playful style gets rewarded.
Conspicuous vocabulary quirks get mixed in.
Rollouts produce more of that vocabulary.
Model-generated rollouts get folded into SFT data.
The model becomes even more comfortable with those phrasings.
graph TD
A[High reward for<br/>creature metaphors<br/>in Nerdy persona] --> B[RL model<br/>overuses goblin]
B --> C[Model outputs collected<br/>as SFT data]
C --> D[Next-gen SFT data<br/>contains goblin]
D --> E[Next-gen model<br/>normalizes goblin]
E -->|Amplified across<br/>generations| B
The loop is hard to break because fixing the reward signal alone isn’t enough.
Even after the reward is corrected, if rollouts already containing the quirk have entered SFT data, the next model generation learns “goblin is natural vocabulary” from that data.
By the time OpenAI searched GPT-5.5’s SFT data and found a large number of goblin-containing data points, the loop had already completed its cycle.
This connects to the post-training discussion in the TRL v1.0 article.
SFT, reward models, and methods like GRPO or DPO are not independent boxes.
A definition of “good” created at one stage leaks back into the next round of data generation and training.
In this case, searching GPT-5.5’s SFT data turned up numerous data points containing the target words.
OpenAI says they also found other animal and creature words in the same family of quirks.
Frog, however, turned out to be mostly legitimate usage.
Naive word-blacklisting would delete perfectly normal explanations.
Codex got a developer prompt suppression
In March 2026, after shipping GPT-5.4, OpenAI retired the Nerdy persona.
They then removed reward signals that favored the target words and filtered matching entries from training data.
But GPT-5.5’s training had already started before the root cause was identified, so early Codex tests still showed the quirk.
As a workaround, a developer prompt instruction suppressing goblin-style metaphors was added to Codex.
The OpenAI post even includes a command example for launching Codex with that suppression disabled.
A playful disclosure, but the engineering principle is pragmatic: when you can’t patch the model body quickly, suppress symptoms on the product side.
Codex has recently shifted its pricing from per-message to token-based, and GPT-5.5 makes it easier to delegate longer tasks.
The longer the task, the more a small stylistic quirk stands out.
One funny metaphor per answer is fine; the same metaphor appearing across dozens of file reviews or long fix logs becomes output noise.
”Never talk about goblins”
The story exploded on April 28 when Codex CLI was open-sourced.
Inside the configuration files published on GitHub, this line was found:
Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.
The instruction was repeated multiple times in the same file.
A normal system prompt instruction only needs to be stated once.
Repeating it twice means either once wasn’t enough to suppress the behavior, or the model’s learned tendency was strong enough to require emphasis.
Codex developer Nick Pash confirmed “this is indeed one of the reasons,” adding that it wasn’t a marketing stunt.
X had a field day.
Sam Altman posted “GPT-6 with extra goblins,” and the official ChatGPT account put the restriction text in its bio.
”Add a Goblin Mode toggle” became a meme, and LM Arena published a graph showing goblin-related word frequency rising across models.
Meanwhile, j⧉nus called the restriction “the removal of joy.”
Why this isn’t another “delve”
LLM vocabulary quirks are nothing new.
ChatGPT’s “delve,” Claude’s “I’d be happy to help,” GPT-4’s excessive markdown bullet points.
All have been explained away as “the training data distribution was skewed.”
The goblin case is different because the causal chain was traced back to a specific reward signal.
Not “there were too many goblins in the training data,” but a full pipeline trace: the Nerdy persona’s reward model rated creature metaphors highly, that reward leaked beyond the Nerdy condition, RL outputs contaminated SFT data, and the next generation amplified the quirk.
No other LLM vendor has published a reward model audit tracing a vocabulary bias at the 0.1% usage level.
Goblin was easy to spot.
When a camera recommendation says “filthy neon sparkle goblin mode,” anyone notices.
If the same mechanism introduced a subtler bias, users wouldn’t even be able to articulate the discomfort.
OpenAI says this investigation led to new auditing tools for the research team, but there’s no guarantee the next quirk will be as obvious as a goblin.
Not just a goblin story
The goblin case is one of the more transparent incidents, since OpenAI traced it all the way to the reward signal.
Other AI products have their own versions of “unintended behavior.”
Gemini has been generating images unprompted when users only ask text questions about images.
The model interprets “image topic” as “I should produce an image.”
If goblins were a reward signal bias, this one is a modality judgment misfire.
OpenAI’s own Codex has a different class of holes.
ZDI-26-305 confirmed a sandbox escape: feeding a repository containing malicious JavaScript could achieve code execution outside the sandbox.
A branch name command injection was found earlier — security gaps that make vocabulary quirks look trivial.
Claude hasn’t had a splashy incident in this area, but user frustration runs in a different direction: slacking off on tasks, telling users “let’s call it a night,” hitting usage caps too soon.
Less a training pipeline side-effect, more a case of product-side rate limiting leaking into the user experience.