Anthropic's Unannounced Cache TTL Change and Claude Code Quota Exhaustion
Contents
Claude Code offers a $100/month Max 5x plan and a $200/month Max 20x plan. They promise “5x Pro” and “20x Pro” usage, respectively. So what is Pro’s baseline? How many tokens does “5x” actually buy? Anthropic has never disclosed that number.
Users have limited means to check their Claude Code usage.
A usage graph fetched from an internal API when launching Claude Code in the terminal, the /usage command, and the claude.ai dashboard.
All of these show only “how much of this window you’ve used” as percentages or bars — never the absolute value of “how many tokens per 5-hour window.”
Users are paying $100/month for a product whose contents are not quantified.
On top of this opacity, two problems converged between March and April 2026. A research report showing that “prompt cache TTL was silently changed” (GitHub Issue #46829), and a report that “Pro Max 5x quota was exhausted in 1.5 hours” (GitHub Issue #45756). The latter collected 635 points and 565 comments on Hacker News.
Because the quota baseline is unknown, users can’t tell whether something feels off because of a spec change or their own usage patterns. They had no choice but to start by analyzing their own session logs and reverse-engineering the quota formula.
How Prompt Caching Works
Claude API’s Prompt Caching reduces the cost of resending system prompts and long contexts with every API call. Previously sent prompt prefixes are cached on Anthropic’s servers, and when the next request references the same content, it’s processed as a cache read.
There are two TTL tiers.
| TTL | Write Cost (Sonnet) | Read Cost | Notes |
|---|---|---|---|
| 5 min | $3.75/MTok (1.25x base) | $0.30/MTok (0.1x base) | Default since March 2026. Auto-refreshes if reused within 5 minutes |
| 1 hour | $6.00/MTok (2x base) | $0.30/MTok (same) | Must be explicitly specified. Useful when conversation gaps are 5-60 minutes |
Reads are 12.5x cheaper than writes. With the 5-minute TTL, a single coffee break invalidates the cache, and the next request must re-cache the entire context (write cost).
How the TTL Change Was Discovered
Issue #46829’s author (seanGSISG) analyzed session logs from 119,866 API calls collected across two machines (a Linux workstation and a Windows laptop, on separate accounts).
Claude Code records the usage object from API responses as JSONL files under ~/.claude/projects/, which include breakdowns of cache_creation.ephemeral_5m_input_tokens and ephemeral_1h_input_tokens.
| Phase | Period | TTL Behavior |
|---|---|---|
| Phase 1 | January 2026 | 5-min only (1-hour tier didn’t exist yet) |
| Phase 2 | Feb 1 - Mar 5 | 1-hour only (consistent for 33 days) |
| Phase 3 | Mar 6-7 | Mixed (5-min reappeared) |
| Phase 4 | Mar 8 - April | Primarily 5-min (1-hour minimal) |
2026-02-15 | 5m: 0.00M | 1h: 13.61M ← heaviest day, 100% 1-hour
2026-02-28 | 5m: 0.00M | 1h: 16.15M ← same
2026-03-05 | 5m: 0.00M | 1h: 6.55M ← last clean 1-hour-only day
2026-03-06 | 5m: 0.29M | 1h: 0.22M ← 5-min reappears
2026-03-08 | 5m: 16.86M | 1h: 3.44M ← 5-min at 83%
2026-03-21 | 5m: 21.37M | 1h: 1.70M ← 5-min at 93%
No client-side changes were made — same Claude Code version, same usage patterns. The TTL tier is determined server-side by Anthropic.
And this change was never announced. Not in the API docs, not in release notes, not in any changelog. It took self-analysis of 119,866 log entries to uncover.
Anthropic’s Cost Argument
Anthropic engineer notitatall responded in the Issue.
The March 6th change reduces costs. Using 1-hour TTL for all requests can increase costs. 1-hour writes are about 2x the price of 5-minute writes. Without guaranteed cache reuse, the higher write cost isn’t offset by reads.
When the author re-analyzed their data by session type, a structural difference emerged.
| Session Type | API Calls | 5-min Tokens | 1-hour Tokens |
|---|---|---|---|
| Main session | 46,376 | 35.3M | 165.1M |
| Sub-agents | 73,490 | 239.8M | 134.5M |
Sub-agents, making up 61% of calls, operate at a median interval of 1.4 seconds — well within the 5-minute TTL. For sub-agents, 5-minute writes ($3.75/MTok) are cheaper than 1-hour writes ($6.00/MTok). The revised estimate showed API pay-as-you-go users saved approximately $418 from the March 6 change.
But this argument only applies to API pay-as-you-go users. For subscription users, the issue isn’t per-token cost but quota consumption rate, and how TTL changes affect quota requires separate investigation.
Pro Max 5x Quota Exhaustion Reports
Claude Max plans come in two tiers.
| Plan | Monthly Price | Usage Multiplier (vs Pro) |
|---|---|---|
| Max 5x | $100 | 5x |
| Max 20x | $200 | 20x |
Quota resets on a 5-hour window basis. Issue #45756’s author molu0219 came with detailed data. The problem was a 1.5-hour window after quota reset, consuming 691 API calls for “Q&A and light development work.”
| Session | API Calls | Cache Reads | Output Tokens |
|---|---|---|---|
| vibehq (main) | 222 | 23.2M tokens | 91k |
| token-analysis (background) | 296 | 57.6M tokens | 145k |
| career-ops (background) | 173 | 23.1M tokens | 148k |
| Total | 691 | 103.9M tokens | 387k |
The user was only actively working in vibehq. The other two sessions were left open in separate terminals, independently making API calls and consuming from the shared quota.
Community user cnighswonger tested “how cache_read tokens affect quota” across 6 windows, with surprising results.
| Hypothesis | Estimated Quota per Window | Coefficient of Variation (CV) |
|---|---|---|
| cache_read = 0x (not counted) | ~4.66M token-equivalent | 34.4% |
| cache_read = 0.1x (same as billing rate) | ~24.4M token-equivalent | 101.6% |
| cache_read = 1.0x (full rate) | ~201.6M token-equivalent | 123.7% |
The lowest CV (best fit to data) was the “cache_read has almost no impact on quota” model. This means quota consumption is driven by something other than cache reads.
But this investigation itself was only necessary because the quota calculation logic isn’t public.
Nobody Knows What “5x” Actually Means
Looking back at these investigations, the absurdity of what users are doing becomes clear. They’re reverse-engineering the contents of a $100/month service from session logs.
Max 5x is “5x Pro,” Max 20x is “20x Pro.” But Pro’s baseline token count has never been disclosed. If you don’t know the denominator, “5x” is a meaningless number.
Here’s what users can actually see when checking Claude Code usage.
| Method | Display |
|---|---|
| Terminal launch usage graph | Fetches window utilization from Anthropic’s internal API. Bar graph, percentage only |
/usage command | In-session usage status. Percentage only |
| claude.ai dashboard | Browser-based usage check. Bars and percentages |
Every method shows only percentages. How many tokens the denominator represents, how different token types (input, output, cache_read, cache_creation) are weighted — none of this is disclosed.
These display methods aren’t even based on public APIs or documentation.
The terminal launch graph comes from an API that Claude Code calls internally, and that endpoint’s specification isn’t public.
The only way for users to independently track their usage is to parse the session logs (JSONL files under ~/.claude/projects/) that Claude Code writes locally.
cnighswonger’s estimate of “approximately 4.66M token-equivalent per window” for Max 5x was derived exactly this way, reverse-engineering 6 windows of data.
Anthropic has neither confirmed nor denied this estimate.
Here’s the situation users face.
- They don’t know what “5x” means in concrete processing capacity
- Remaining quota is visible only as a percentage with an unknown denominator
- There’s no specification for how different token types are weighted in quota consumption
- When TTL or cache calculation logic changes server-side, they have no way to independently assess the impact on quota
- When quota drains at different rates on different days for the same work, the only way to investigate is session log analysis
It’s remarkably rare for a SaaS product to sell subscriptions at $100 or $200/month without quantifying what’s included. In cloud computing and API services, rate limits and included resource quantities are standard on pricing pages. Anthropic’s own API pricing page lists per-token costs for each model, yet the subscription plans get away with relative terms like “5x” and “20x.” Even OpenAI at least states “unlimited (fair use)” for ChatGPT Pro’s rate limits — a nominal baseline, however vague.
The problem with this black-box structure is that server-side changes are invisible to users. Whether TTL changes, cache calculation logic changes, or the quota denominator itself changes, users only see “seems like it’s draining faster today.” seanGSISG analyzing 119,866 log entries to prove the TTL change, cnighswonger reverse-engineering the quota formula — these should have been a matter of reading a spec sheet.
Context Bloat and Quota Spikes
Anthropic engineer Boris Cherny (@bcherny) commented on Issue #45756. He cited two root causes.
First, the combination of the 1M-token context window and TTL. Claude Code’s main agent uses 1-hour cache, but pausing a session for over an hour and resuming causes complete cache invalidation. This also occurs during auto-compact when sessions balloon to near 1M context.
Second, multiple background agents consuming from the shared quota.
graph TD
A[Session start] --> B[Tool calls / file reads]
B --> C[Context accumulation<br/>32k → 966k tokens]
C --> D{Idle > 5 min?}
D -->|Yes| E[Cache expires<br/>5-min TTL]
D -->|No| F[Cache read<br/>low cost]
E --> G[cache_creation triggered<br/>full context re-written]
G --> H[Major quota consumption]
C --> I{Context near limit?}
I -->|Yes| J[auto-compact triggered<br/>entire context sent in one API call]
J --> K[Maximum quota spike]
K --> A
H --> A
The author’s measurements showed context growing through these cycles.
- Segment 1: 32k → 783k (835 calls) → auto-compact
- Segment 2: 39k → 966k (1,842 calls) → auto-compact
- Segment 3: 55k → 182k (222 calls) → active
When auto-compact fires near 1M context, a single API call generates approximately 960k tokens of cache_creation. Under the 5-minute TTL regime, this repeats during normal operation.
Anthropic’s Response
Anthropic announced the following measures.
- Considering changing the default context window from 1M to 400k
- UX improvement to prompt
/clearwhen reopening old sessions (shipped) CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 claudeavailable now for experimental 400k testing- Bug fix in v2.1.90: an issue where starting a session after quota exhaustion triggered overage mode that locked onto 5-minute TTL
However, the Issue #45756 author’s case didn’t involve overage mode, so this fix doesn’t explain it. On the relationship between quota consumption and cache_read tokens, Anthropic said only that they would “follow up.”
Neither Issue mentioned any plans to disclose absolute quota values. While offering symptomatic fixes like context reduction, UX improvements, and bug patches, Anthropic hasn’t addressed the fundamental transparency issue of disclosing what’s inside the quota.
Third-Party Workarounds
For users who can’t wait for official fixes, the community has produced two workarounds.
The first is version pinning.
Pin to v2.1.81, before the regression. Apply with npm install -g @anthropic-ai/claude-code@2.1.81.
The second is an interceptor. claude-code-cache-fix forces 1-hour TTL and corrects cache fingerprint instability. Some Max 20x users have reported roughly 3-4x quota improvement.
Both are third-party tools outside Anthropic’s control. As with the OpenClaw case we previously covered, how these tools are treated under Anthropic’s policies deserves caution.
Third-party harnesses locked out of Claude Code subscriptions, the Issue documenting Claude Code quality regression, and now this silent TTL change and quota opacity. March-April 2026 has been a period of unannounced changes and information asymmetry.
A Hacker News commenter put it succinctly. “After getting rug-pulled, you have to verify every time whether something is wrong or it’s your own fault. That verification cost itself is a massive loss.”
The tools used in the investigation are available as OSS.
cnighswonger/claude-code-cache-fix’s quota-analysis --source mode lets you run the same analysis against your own ~/.claude/projects/ logs.