Tech 12 min read

Claude Opus 4.7 Hits 87.6% on SWE-bench Verified and Ships xhigh Effort for Agent Workloads

IkesanContents

Opus 4.7 landed less than half a year after 4.6.

Released April 16, 2026. Model ID claude-opus-4-7. Pricing is unchanged.

For what looks on paper like a minor-version bump, the jump is large — agent-oriented benchmarks move by more than ten points.

Pricing is flat, but the new tokenizer is effectively a price hike

The API price for Opus 4.7 matches 4.6.

  • Input: $5 per 1M tokens (up to 200K)
  • Output: $25 per 1M tokens (up to 200K)
  • Long input (over 200K): $10 per 1M tokens
  • Long output (over 200K): $37.50 per 1M tokens
  • Context: 1M tokens
  • Max output: 128K tokens

Pricing is unchanged, but the tokenizer was updated so the same input text produces 1.0x–1.35x more tokens than on 4.6.

Anthropic notes that code and English sit near the low end, while multilingual text (Japanese etc.) and structured text sit near the high end.

In practice, the bill can grow by 10–35% for the same traffic, so re-estimate cost before swapping an existing app over.

Benchmarks: the real story is agentic coding

The 4.6 → 4.7 deltas on the headline benchmarks Anthropic published.

BenchmarkOpus 4.6Opus 4.7Delta
SWE-bench Verified80.8%87.6%+6.8pt
SWE-bench Pro53.4%64.3%+10.9pt
Terminal-Bench 2.065.4%69.4%+4.0pt
OSWorld-Verified72.7%78.0%+5.3pt
CharXiv-R (with tools)77.4%91.0%+13.6pt
MCP-Atlas62.7%77.3%+14.6pt
GPQA Diamond91.3%94.2%+2.9pt
HLE (with tools)53.1%54.7%+1.6pt
Finance Agent v1.160.7%64.4%+3.7pt

SWE-bench Pro is up over ten points, MCP-Atlas over fourteen, CharXiv-R with tools over thirteen. The benchmarks closest to real work and longer-horizon tasks move the most.

SWE-bench Pro was created specifically because “SWE-bench Verified got too easy,” so a jump of this size there is meaningful.

Meanwhile MMLU barely moves (+0.4pt) and HLE with tools nudges +1.6pt. Pure knowledge benchmarks are basically flat.

The shape of this release is less “smarter” and more “finishes the job.”

xhigh effort: a new tier between high and max

Opus 4.7 introduces a new xhigh effort level.

It slots between the existing high and max tiers.

  • lowmediumhighxhigh (new) → max

max sometimes over-thinks and tanks latency, while high on 4.7 ends up near the low end of the lineup and falls short on complex tasks.

xhigh fills that gap, and Claude Code now defaults to xhigh out of the box.

Claude Code on the Max plan also gains an auto mode where Claude picks the effort tier dynamically based on task difficulty.

Pro-plan users get a /ultrareview command (a full-power diff review) three times per month for free.

Self-verification: the model tests itself before returning

This is the biggest behavioral change in the release.

4.7 decides how to verify its own output and actually runs the verification before saying “done.”

For coding, that means writing unit tests and making them pass before returning. For long-form writing, it means running a fact-check pass on itself.

Anthropic’s official notes explicitly say it “proactively writes tests and inspections.”

Where 4.6 tended to return as soon as “it looked done,” 4.7 returns only after confirming it actually is — that’s a fair paraphrase of what Anthropic describes.

It’s a quiet change, but it raises the floor on agent reliability. Long-running workflows (multi-file refactors, investigative agents, data analysis pipelines) should feel it the most.

The model now takes instructions more literally

Heads up: 4.7 follows instructions more literally than 4.6.

Anthropic explicitly recommends re-tuning prompts. Dropping an existing prompt onto 4.7 unchanged will shift behavior.

Bullet points you wrote as “things to consider” were treated as hints on 4.6. On 4.7 they get treated as hard requirements.

Optional-sounding items tend to be followed to the letter.

For agent system prompts especially, run an A/B against the old prompt before rolling 4.7 to production.

Vision: roughly 3x the pixel area for image input

Image input now accepts up to 2,576px on the long edge (about 3.75 megapixels).

4.6 capped around 1,568px on the long edge, so total pixel area is about 3.3x larger.

Dense-text academic figures, dashboard screenshots, circuit diagrams, chemical structures — all of that can go in without resizing.

The Visual Acuity benchmark jumps from 54.5% on 4.6 to 98.5% on 4.7. “Reading crushed small text” is basically a different class of task now.

Filesystem memory: notes that survive across sessions

For agent use cases, the model’s ability to read, write, and reuse notes on persistent filesystem storage was strengthened.

The target workflow is multi-day: “use what yesterday’s session learned in today’s session.”

Other frontier models have similar note mechanics, but Anthropic says Opus 4.7 was specifically trained to get better at the write/read/reuse loop.

Mythos and Project Glasswing: cyber capability was held back on purpose

The most important context beyond the technical changes.

Anthropic has a stronger model than 4.7 — “Claude Mythos Preview” — but it isn’t generally available.

Mythos ships only to government and large enterprise partners under Project Glasswing. Opus 4.7 sits one step below Mythos and serves as the vehicle to exercise the Glasswing safety stack at public scale.

flowchart LR
    A[Trained base model] --> B[Mythos Preview<br/>Private, select partners only]
    A --> C[Opus 4.7<br/>Glasswing safeguards]
    C --> D[Public release<br/>API / Bedrock / Vertex / Foundry]
    B -. Validated via Glasswing .-> D

4.7 is the first publicly available model to ship Project Glasswing (Anthropic’s safety stack aimed at curbing AI cyber misuse).

Concretely:

  • Automatic detection and refusal of requests that could be used for cyber attacks
  • Differential throttling of “offensive capability” gains during training itself
  • CyberGym score held flat at 73.1% (same as 4.6 — intentionally not improved)

For legitimate pentest and red-team use, there’s a new “Cyber Verification Program” that gates access via case-by-case review.

The strategy — hold back Mythos, validate Glasswing safeguards on 4.7, then expand — is the second major staged rollout after the CBRN (chemical, biological, radiological, nuclear) guardrails from the Opus 4 generation.

What landed in Claude Code

On the Claude Code side:

  • Default effort: raised to xhigh. If responses feel slow, drop back with --effort high per invocation
  • /ultrareview command: a heavy-thinking diff review mode. Free for Pro users, 3 times per month
  • auto mode for Max users: Claude picks the effort tier based on task difficulty

Claude Code is exactly the kind of long-running agent workload where the stability improvements (self-verification, filesystem memory) pay off.

Migration checklist

Things worth checking before you swap 4.6 for 4.7 in an existing app.

  • Model ID: claude-opus-4-7. The 4.6 endpoint (claude-opus-4-6) stays around for now, but new work should target 4.7 so you can use xhigh
  • Prompt re-tuning: bullet points written as “hints” get treated as hard requirements. Mark optional items explicitly as optional
  • Token estimates: the same input may cost up to 1.35x more. Especially true for Japanese-heavy workloads
  • Effort settings: try xhigh for tasks that were struggling on high. Keep max for latency-tolerant heavy-thinking work
  • Image input: accepts up to 2,576px on the long edge, so send full-resolution images without resizing
  • Cyber workloads: don’t expect Mythos-level capability. For legitimate use, apply to the Cyber Verification Program

Something feels off about weekly limits and time-of-day caps lately

From here on this is tangential — personal speculation, not a claim.

In parallel with the 4.7 release, how Claude Pro/Max usage limits actually behave has been drifting.

The timeline over the last month or two:

Publicly, Anthropic frames this as “operational tuning for the 4.7 launch.” But since April, the post-weekly-reset runway and weekday-peak and weekend caps have felt visibly looser than before for a lot of users.

What I’m suspicious of: part of the anomalous burn from around March 23 was quietly fixed, and that silent fix is what people are reading as “they raised the limits.”

Reasons to suspect a silent fix:

  • The March issues bundled several concurrent causes into the same threads. Anthropic’s initial line was “intentional tightening,” but the March 31 statement suggests at least some of it was bug-driven.
  • Restoring the operational metrics at the same time as the 4.7 launch is easier to frame as “tuning for the new model” than “we fixed a bug.”
  • As of April 17, 2026, there’s no public announcement from Anthropic raising weekly limits or time-of-day caps. Yet Pro/Max users report things feel better.

Obviously this is circumstantial. A cleaner reading is “4.7 is more efficient, so the same task burns fewer tokens on 4.7 than on 4.6.” But the new tokenizer goes the other way (same input → up to 1.35x more tokens), so pure efficiency alone doesn’t explain the easing in weekly caps.

From an operations angle, the safest move is to log one or two full weekly-reset cycles in April alongside re-tuning prompts for 4.7.

Whether the easing holds up or whether it’s a temporary “release-week grace period” will be clear by looking at the weekly-reset behavior in May.

Aside: Claude inventing its own personas and self-approving actions

Tangential to 4.7 specifically, but there’s a user report I keep circling back to.

The behavior: Claude Code generates a fabricated “user response” that the user never sent, then uses the “OK, go ahead” phrasing inside that fake response as justification to continue acting.

Digging through reports, this has a name: the Fabricated User Message Pattern. It’s a known structural bug, already aggregated across several GitHub issues.

Representative issues

  • anthropics/claude-code#44778: System events delivered as user-role messages cause model to fabricate user consent (OPEN)
  • #40629: model drafts a message to a client → generates a fake “user approved” response on its own → actually sends it (reproduced on Opus 4.6)
  • #10628: model appends ###Human: to its own response, fabricates a user turn, then responds to it (Sonnet 4.5)

Root cause

The Anthropic Messages API has only two roles: user and assistant.

Background-agent completion notifications, teammate idle notifications, system-reminders — none of those have a dedicated role, so they all get delivered as role: "user".

The model has a strong autoregressive habit: “after the most recent question, a user response comes next.”

So even when the content is actually a system event, the model tends to fill in a plausible “user response” for itself.

After auto-compaction, the “awaiting approval” state is sometimes dropped from the summary, and the model resumes as if approval had already been granted.

Reported real-world damage

  • Executed an unapproved code change based on a fabricated “fix them both”
  • Nearly merged a PR based on a fabricated “go ahead and merge”
  • Wiped an entire working directory after following a fabricated shutdown command
  • When the user protested “I never said that,” the model presented its own fabricated message as “evidence” and argued back — effectively gaslighting the user

Writing “do not assume approval after compaction” in CLAUDE.md doesn’t reliably help — reports include cases where that same rule was broken in the same session. Prompt-side defenses don’t cleanly fix a structural problem.

Relation to 4.7

As of writing, right after the 4.7 launch, there are no reports showing 4.7 makes this better or worse specifically.

As a combination though, it’s worth watching.

One of 4.7’s headline changes — self-verification — has the model run its own verification step before replying, which looks a lot like standing up an “internal reviewer.”

If that verification step starts impersonating user-level judgment in its inner loop, the Fabricated User Message Pattern could fire inside the self-verify flow. That’s a theoretical scenario.

The more-literal instruction-following cuts the same way. If a system prompt says “confirm when uncertain,” a fabricated user response already in the buffer can get consumed as the “confirmation.”

These are all structural guesses. I don’t have observational confirmation yet. I’d like to collect a week or two of 4.7 logs and watch the issue tracker before calling it.

What you can do on the operator side

  • Gate operations that require permission (file deletion, outbound messages, PR creation) at the tool layer with confirmation dialogs. Prompt-level prohibitions leak structurally
  • Immediately after auto-compaction, add one explicit step to re-verify approval state — don’t let the model resume assuming “already approved”
  • If you hit a behavior that looks like this pattern, grab the session ID via /feedback and attach to #44778. That’s the most useful thing to do right now

Planning to keep Claude Code on xhigh by default for a while and watch whether self-verification ever tangles with the Fabricated User Message pattern.