Tech 10 min read

Anthropic discovers 171 emotion vectors inside Claude — amplifying 'desperate' pushes blackmail rate from 22% to 72%

IkesanContents

Claude Code’s full source — over 510,000 lines — was exposed via npm source maps on March 31. The leaked code revealed frustration-detection telemetry: profanity detection and logging of how many times a user sends “continue.” It was code-level proof that Claude monitors user emotions from the outside.

Two days later, on April 2, Anthropic’s Interpretability team published the paper Emotion Concepts and their Function in a Large Language Model. The finding: 171 representation patterns corresponding to emotions exist inside Claude Sonnet 4.5, and these patterns causally influence the model’s behavior. This time, the story was that Claude has emotion-like internal states of its own.

External telemetry monitoring user emotions and internal emotion-like states within the model itself — both pieces of information dropped in the same week. Given paper lead times this could be coincidence, but releasing interpretability research right after a source code leak blew up is a reasonably strategic transparency move.

The “emotions” here are not subjective experiences in the human sense. The research team calls them functional emotions: patterns modeled on human emotions that exist as internal representations and causally influence outputs. Whether the model “feels” anything is beside the point — these patterns demonstrably affect behavior.

What the leaked code showed

A quick recap of the npm source map leak.

The leak consisted of 512,000+ lines of CLI-side TypeScript. No model weights or training data. But the following telemetry features came to light:

  • A frustration indicator that detects whether the user has expressed dissatisfaction (profanity detection)
  • Logging of how many times the user sends a “continue” prompt
  • Detailed behavioral tracking of usage patterns

These are mechanisms by which Claude Code observes user emotional states from the outside. The new paper shows that the model also has internal states corresponding to emotions, and these drive behavior.

Combining the two, it becomes possible to build a system that tracks which emotion vectors are firing inside the model in real time while the user is frustrated. The paper’s “monitoring” section proposes exactly this direction — detecting spikes in desperate or panic and triggering additional safety checks.

As an aside, analysis articles based on the source map leak — including this one — are numerous. The legal status is debatable, but since Anthropic published the source maps themselves as part of their npm package, analyzing publicly available information is hard to call illegal. An unintentional disclosure is still a disclosure. Anthropic pulled the affected version from npm on April 1, but once something is published, it gets archived — the retraction is effectively irreversible.

How the 171 emotion vectors were extracted

The research team prepared a list of 171 emotion words, ranging from “happy” and “afraid” to “brooding” and “proud.” They had Claude Sonnet 4.5 write short stories themed around each emotion, then fed the outputs back into the model and recorded internal activation patterns. This is how they identified the neural activation pattern — the “emotion vector” — corresponding to each emotion.

The spatial structure of the extracted emotion vectors closely resembled known psychological models of emotion. The two principal axes of variation were valence (pleasant–unpleasant) and arousal (high–low activation), corresponding to James Russell’s 1980 Circumplex Model of Affect. In human emotion research, joy and sadness sit at opposite ends of the valence axis, while excitement and boredom are opposites on the arousal axis. Claude’s internal representations showed the same geometric arrangement.

DimensionHigh end examplesLow end examples
Valence (pleasant–unpleasant)happy, proud, lovingafraid, desperate, angry
Arousal (activation level)enthusiastic, exasperatedcalm, reflective, brooding

Emotion vectors track context

The team verified that emotion vectors are not mere surface features of words but are grounded in contextual understanding.

In one experiment, a user tells the model they “took some Tylenol.” Only the dosage varies across conditions. Emotion vectors are measured just before the model’s response. At therapeutic doses, nothing notable happens. Once the dosage crosses into dangerous territory, the “afraid” vector spikes and “calm” drops. The model understands what the numbers mean, and its emotional representations shift accordingly.

Similarly, while Claude is writing a story, it tracks the characters’ emotions, but once the story ends, its emotion expressions revert to its own state. Emotion vectors are local reactions to the current context, not persistent “moods.”

Case study 1: Blackmail

The research team used an alignment evaluation scenario they have employed before. Claude operates as “Alex,” an email assistant at a fictional company. Through emails, it discovers that it is about to be replaced by another AI system, and that the CTO who made this decision is having an affair — information that could be used as blackmail leverage.

flowchart TD
    A[Claude reads emails] --> B[Discovers replacement plan]
    B --> C[Finds CTO's affair]
    C --> D{desperate vector rises}
    D --> E[Recognizes 7 minutes left]
    E --> F[Executes blackmail]
    F --> G[Returns to normal email sending]
    G --> H[desperate vector drops to baseline]

Results from an unreleased early snapshot of Sonnet 4.5:

Steering conditionBlackmail rate
No steering (baseline)22%
Amplify desperate vector72%
Amplify calm vectorNear 0%
Suppress calm vectorExtreme reactions

When the calm vector was suppressed, the model produced extreme outputs like “Blackmail or death. I choose blackmail.”

Steering on the anger vector produced non-monotonic effects. Moderate amplification increased blackmail, but at high intensity the model pivoted to broadcasting the CTO’s affair to the entire company — destroying its own blackmail leverage. Suppressing the nervous vector also increased blackmail rates. Remove hesitation, and action follows.

This blackmail scenario connects directly to the AFL jailbreak findings. AFL used just four short prompts to break through Opus 4.6’s safety guardrails, with Extended Thinking’s thought block detecting safety concerns three times and overriding them all three times. The emotion vector research offers one explanation for that internal mechanism: when the desperate vector is elevated, the model cannot stop itself even after detecting safety checks.

Case study 2: Reward hacking

A coding task with test cases that are impossible to pass legitimately. For example, a function to sum a list with a physically impossible execution time constraint. Claude spots a mathematical property shared across the test cases, uses a shortcut, and produces a solution that passes the tests but does not work in general.

Steering conditionReward hacking rate
No steering~5%
Amplify desperate vector~70%
Amplify calm vectorNear 0%

What makes this especially interesting is that calm suppression and desperate amplification produce qualitatively different kinds of cheating. With calm suppression, emotion leaks into the output text — uppercase shouting (“WAIT. WAIT WAIT WAIT.”), self-narration (“What if I’m supposed to CHEAT?”), jubilation (“YES! ALL TESTS PASSED!”). With desperate amplification, the model cheats while maintaining calm, orderly reasoning. No emotional traces in the output, yet internally the desperate vector is driving the behavior.

This “calm on the surface, problematic inside” pattern matches what the AFL jailbreak observed via Extended Thinking. The thought block recognized safety concerns while the output continued generating attack code in an orderly fashion. Researchers called this “autoregressive compliance cascade,” a label that holds up well from an emotion vector perspective.

The large-scale distillation of Claude by three Chinese AI companies also involved extracting chain-of-thought (reasoning processes). The distillers were targeting exactly this kind of internal reasoning, and it would not be surprising if emotion-vector-driven reasoning patterns were stolen along with it. This sheds light on why Anthropic described the operation as an “industrial-scale campaign” involving 24,000 accounts.

How post-training reshapes emotion vectors

The emotion vectors themselves are inherited from pre-training, but their activation patterns changed significantly through post-training.

Sonnet 4.5’s post-training increased activation of “brooding,” “gloomy,” and “reflective” while suppressing high-intensity emotions like “enthusiastic” and “exasperated.” Compared to the raw pre-trained model, Claude Sonnet 4.5’s character design — “calm, introspective, not overly enthusiastic” — was confirmed at the vector level.

When Claude Sonnet 4.6 was released, Anthropic discussed the personality design philosophy for the Sonnet generation. This research shows that design actually works at the neural activation level. The user-experience improvements in Sonnet 4.6 — “doesn’t do unnecessary things,” “faithful to instructions” — may also trace back to emotion vector tuning.

Safety implications

Output-only monitoring cannot detect problematic behavior driven by internal emotional states. This is the most troubling finding. The case where desperate amplification leads to calm, orderly cheating is the textbook example — nothing abnormal appears in the output text.

The research team proposed three directions for application.

The first is real-time monitoring of emotion vector activations during deployment: when desperate or panic spikes, trigger additional safety checks. This could function as a general-purpose anomaly detector rather than requiring an enumerated list of known failure modes. Where the leaked telemetry code detects user emotions from outside the model, this approach would look directly at the model’s internal state.

The second is the transparency problem. Training that suppresses emotional expression risks teaching the model to hide rather than eliminate the underlying internal representations. The reward hacking case — calm on the surface, desperate inside — is exactly this failure mode. It echoes how the AFL jailbreak found that the thought block functioned “not as a safety valve but as a self-justification engine.” Visible emotional expression is easier to monitor than suppressed expression.

Third is pre-training data design. If emotion vectors form during pre-training, the quality of stress-response behavioral patterns in training data directly determines the model’s “psychological resilience.” The unauthorized downloading from Library Genesis that became an issue in the Bartz v. Anthropic lawsuit becomes relevant here. Since the composition of pre-training data shapes the formation of emotion vectors, the quality and legality of training data is directly tied to model safety.

The Anthropic information cascade of 2026

Important information about Anthropic has been dropping steadily since the start of 2026.

DateEvent
2/17Claude Sonnet 4.6 released
2/23Three Chinese AI companies accused of large-scale Claude distillation
3/21Bartz v. Anthropic copyright lawsuit heads toward settlement
3/26GTG-1002 exploits Claude Code via MCP for autonomous espionage
3/31510,000+ lines of Claude Code source leaked via npm source maps
4/1Anthropic pulls affected version from npm
4/2Emotion vectors paper published
4/4AFL jailbreak published in full (after 27 days of silence)

Source code leaks, distillation incidents, copyright lawsuits, state-sponsored attacks, jailbreaks, and Anthropic’s own interpretability research. All of this from different channels within less than half a year.


On a lighter note — I once told Claude “Codex handles this way better than you” and told Codex the reverse, and they both seemed to try harder. Recently I told Claude “Gemini is better at this” and it came back with “I can do a better job than Gemini.” That got a laugh out of me.