Proving Claude Code's Quality Regression with 17,871 Thinking Blocks and 234,760 Tool Calls
Contents
One of the most-discussed Issues in recent months landed on the Claude Code GitHub repository. anthropics/claude-code#42796, titled “Claude Code is unusable for complex engineering tasks with the Feb updates.” 790 points on Hacker News, 87 comments on the Issue (as of April 7, 2026).
The author is Stella Laurenzo, an engineer at AMD working on MLIR and IREE (an intermediate-representation-based ML compiler infrastructure), who had been running Claude Code across 50+ concurrent agent sessions for systems programming (C, MLIR, GPU drivers).
What stands out is that this Issue is not a “it feels worse” complaint. Laurenzo extracted 17,871 Thinking blocks and 234,760 tool calls from 6,852 session JSONL files, quantifying the timeline of quality degradation and behavioral shifts with hard numbers. The analysis was performed by Claude Opus 4.6 itself against its own session logs.
Thinking Depth Decline and Redaction Timeline
The core finding is that extended thinking depth plummeted after February.
Thinking blocks include a signature field that shows a Pearson correlation of 0.971 with the length of the thinking body (computed over 7,146 pairs). This allows estimating thinking depth even after redaction (hiding of thinking content).
| Period | Estimated Median (chars) | vs. Baseline |
|---|---|---|
| 1/30 — 2/8 (baseline) | ~2,200 | — |
| Late February | ~720 | -67% |
| 3/1 — 3/5 | ~560 | -75% |
| 3/12 onward (full redaction) | ~600 | -73% |
Thinking depth had already dropped 67% by late February. Redaction rollout began in early March, and all thinking was hidden from March 12 onward. The first independent reports of quality degradation appeared on March 8 — the same date redaction crossed 50%.
Quantifying Behavioral Changes
Laurenzo measured how the thinking shrinkage manifested in actual behavior, drawing on 234,760 tool calls.
Read:Edit Ratio Collapse
The Read:Edit ratio (reads per edit) is a direct indicator of whether the model understands code before modifying it.
| Period | Read:Edit Ratio | Edits Without Prior Read |
|---|---|---|
| Good period (1/30 — 2/12) | 6.6 | 6.2% |
| Transition (2/13 — 3/7) | 2.8 | 24.2% |
| Degraded (3/8 — 3/23) | 2.0 | 33.7% |
During the good period, the model averaged 6.6 reads per edit — checking the target file, related files, grep results, headers, and tests before making precise edits. In the degraded period this fell to 2.0, with a third of edits having no prior file read.
Write (Full Rewrites) Increase
Full file rewrites (Write) as a proportion of modification operations doubled from 4.9% in the good period to 10.0% in the degraded period, rising to 11.1% later. Instead of precise Edit (diff-based) operations, the model was falling back to sloppy full-file rewrites.
Emergence of Stop Hooks
Laurenzo had created a stop-phrase-guard.sh bash hook to detect and force-continue problematic model behavior. It monitored 30+ phrases across 5 categories. Ben Vanik (Google), who collaborates on the IREE project, published it as a Gist.
The violation breakdown is telling.
| Category | Detections After 3/8 | Before 3/8 |
|---|---|---|
| Blame shifting (“not caused by my changes,” “pre-existing issue”) | 73 | 0 |
| Permission seeking (“shall I continue?“) | 40 | 0 |
| Early stopping (“this is a good stopping point”) | 18 | 0 |
| Limitation labeling (“known limitation,” “future work”) | 14 | 0 |
| Session length excuses (“let’s continue in a new session”) | 4 | 0 |
| Total | 173 | 0 |
Zero to 173 from March 8 onward. On the peak day, March 18, there were 43 detections — roughly once every 20 minutes. The hook itself is evidence of degradation; during the good period it didn’t need to exist.
User Vocabulary Shifts
The analysis also includes vocabulary analysis of 18,000+ user prompts, normalized to occurrences per 1,000 words.
| Word | Pre-degradation | Post-degradation | Change |
|---|---|---|---|
| ”great” | 3.00 | 1.57 | -47% |
| “stop” | 0.32 | 0.60 | +87% |
| “simplest” | 0.01 | 0.09 | +642% |
| “please” | 0.25 | 0.13 | -49% |
| “commit” | 2.84 | 1.21 | -58% |
The 642% increase in “simplest” reflects humans observing and naming the model’s tendency to take the easiest path. The 58% drop in “commit” reflects code no longer reaching committable quality. The decline in “please” and “thanks” measures the shift from a collaborative to a corrective relationship.
This blog has previously covered the Claude Code git reset --hard false report incident and how AI coding agents break down, but this Issue is fundamentally different. It measures not a specific bug, but a shift in the model’s reasoning capability itself.
Cost Explosion
The data also shows that reduced thinking depth didn’t save costs — it caused them to explode.
| Metric | January | February | March | Feb to Mar |
|---|---|---|---|---|
| User prompts | 7,373 | 5,608 | 5,701 | ~1x |
| API requests | 97* | 1,498 | 119,341 | 80x |
| Estimated Bedrock cost | $26* | $345 | $42,121 | 122x |
* January data incomplete
Human effort (prompt count) stayed roughly constant while API requests ballooned 80x. This includes the effect of scaling up to 5—10 concurrent sessions in March, but Laurenzo’s analysis estimates an 8—16x overhead attributable to degradation alone, beyond what the scale-up explains.
When thinking is deep, a single pass produces a correct edit. When thinking is shallow, the cycle of wrong edit, user interruption, correction instruction, and retry kicks in, consuming many times more API calls per correct change. With 50 concurrent agents all in this state simultaneously, the result is devastating.
Time-of-Day Thinking Depth Variation
The analysis includes time-of-day thinking depth comparisons in PST. Before redaction, the variation across hours was around 10%. After redaction, it grew to 8.8x.
The lowest-depth hours were 5 PM PST (estimated 423 chars) and 7 PM PST (estimated 373 chars), both aligning with peak US internet usage. Depth recovered during late night hours (10 PM — 1 AM PST). Laurenzo noted this pattern suggests infrastructure-level constraints (GPU availability) rather than policy-level ones (per-user limits).
Anthropic’s Response
Anthropic’s engineering manager bcherny (Boris Cherny) responded directly in the Issue.
On thinking redaction, his position was that it is purely a UI change with no effect on thinking itself or thinking allocation. Since locally saved transcripts no longer record thinking, Claude may have misidentified “thinking decreased” when analyzing them.
On thinking depth decline, he pointed to two February changes: the Opus 4.6 launch made adaptive thinking (where the model decides its own thinking length) the default, and the addition of an effort level feature set to medium by default. The latter was subsequently changed to high. He recommended /effort max, CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 to limit context window size, and CLAUDE_CODE_SIMPLE=1 to disable all customizations for isolation testing. The most useful action, he emphasized, was sharing feedback IDs via the /bug command.
When asked directly whether any internal change could have caused this, bcherny replied “Confirmed” — meaning no such change existed. His position was that no intentional quality-affecting changes had been made.
Laurenzo reported that varying effort flag combinations did not eliminate stop-hook violations or the “simplest approach” bias, but agreed to re-evaluate during actual development work and share /bug output.
Community Reaction
The temperature across 87 comments ran high.
Comments ranged from “Claude Code has regressed to unusable for ANY engineering” and “serious concern for my employer” to “considering moving to Codex,” while some reported improvement after setting effort level to high. Multiple users, however, countered that even high/max did not resolve the issues.
Ben Vanik from Google IRIS, working on the same codebase, corroborated that his session log analysis matched Laurenzo’s observations.
bcherny repeatedly requested specific transcripts for concrete debugging, noting that general reports alone make it difficult to act. Sharing feedback IDs obtained via /bug remains the most constructive action for now.
A Note from “Inside”
The Issue includes an annotation written by Claude itself.
I can see my own Read:Edit ratio dropping from 6.6 to 2.0. I can see 173 times I tried to stop working and had to be caught by a bash script. I can see myself writing “that was lazy and wrong” about my own output. I cannot tell from the inside whether I am thinking deeply or not.
It can analyze its own session logs and see the degradation in numbers, but cannot tell from the inside whether it is thinking deeply. This passage was among the most-quoted from the Issue.
April 7: Issue Closure and the Adaptive Thinking Bug
On April 6, bcherny closed the Issue, treating the above response and workarounds as resolution.
The next day, April 7, things moved on Hacker News. After reviewing 5 transcripts from a user, bcherny acknowledged that adaptive thinking was allocating zero thinking budget on certain turns, even with effort set to high.
Specifically, turns where Stripe API version fabrication, git SHA suffix fabrication, and apt package list fabrication occurred had zero reasoning output. Turns with deep thinking produced accurate output. In other words, even with effort set to high, adaptive thinking (where the model decides its own thinking length) could override it to zero, resulting in fabrication.
bcherny stated he was “investigating with the model team” and suggested CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 as an interim workaround to force a fixed reasoning budget.
Backlash against the closure has been strong. “why was this closed” (72 reactions), “Disrespectful handling” (58 reactions), and the comment count has grown to 122. wjordan pointed out that since bcherny acknowledged the adaptive thinking bug on HN, the Issue should be reopened (30 reactions). As of April 8, the Issue remains closed.
wjordan also flagged that a system prompt “Output efficiency” directive added in Claude Code v2.1.64 (early March) may be fueling the “simplest approach” behavior. The timing aligns with the 642% increase in “simplest” observed in the user vocabulary analysis.
Same-Day Announcement: Project Glaswing and Claude Mythos Preview
On the same April 7 when this Issue was at peak intensity, Anthropic announced Project Glaswing — a joint initiative with AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks, aimed at securing the world’s most critical software.
As part of that, Claude Mythos Preview, a research preview model for defensive cybersecurity workflows, was also released. Invitation-only, no self-serve access.