Tool-use API design for LLM agents: is_complete, retryable flags, and budget caps that stop loops

Claude Code consumed 1,673,680,266 tokens in five hours on July 21, 2025.
The anthropics/claude-code#4095 issue thread identifies four simultaneous factors: plan-execution loop, cache token explosion, recursive hook calls, and API error loops.
A DEV Community article on Tool-use API design for LLMs reads this class of incident not as “the model wasn’t smart enough” but as “tool boundaries were ambiguous.”

I previously covered killing idle API calls and setting timeouts in tmux-based Claude Code/Codex loops.
That was about stopping the agent from outside.
This article goes one layer deeper: making each tool response self-contained enough that the model knows what to do next without guessing.

Without a termination signal, models call tools again to confirm

The standard LLM agent loop looks like this:

flowchart TD
    A[User request] --> B[LLM decides next action]
    B --> C[Tool call]
    C --> D[Tool result]
    D --> E{Can the model read this as done?}
    E -->|Yes| F[Reply to user]
    E -->|No| B

The problem is when D is ambiguous.
If a search tool returns results: [] alone, the model cannot distinguish “search succeeded with zero matches” from “timed out and returned an empty array.”
A human would check the logs. An LLM calls the tool again.

The original article proposes self-describing responses: echo the query parameters, include total count, a completion flag, and a next-action hint.
The important fields are is_complete and next_action_hint.

{
  "status": "success",
  "query_summary": {
    "destination": "Shibuya, Tokyo",
    "check_in": "2026-07-12",
    "check_out": "2026-07-15",
    "guests": 1,
    "max_price": 150
  },
  "results": [
    { "id": "h_1234", "name": "Hotel Granbell", "price": 128, "currency": "USD" },
    { "id": "h_5678", "name": "Shibuya Stream", "price": 142, "currency": "USD" }
  ],
  "total_matches": 2,
  "is_complete": true,
  "next_action_hint": "Found 2 matches. Present them to the user. Do not search again unless parameters change."
}

next_action_hint feels slightly dirty — it embeds a model-directed instruction inside an API response.
But agent tool responses are not human-facing REST APIs.
When the caller is an LLM, returning conversational state alongside data fits the actual execution model.

In a token management guide for bloated CLAUDE.md files, I wrote that mirroring one-API-per-tool produces too many tool definitions.
The same principle applies here.
Don’t expose raw APIs to LLMs. Wrap them into units where the model cannot get confused.
If you’re building an MCP server, make each tool return state, completion status, and retryability rather than being a thin proxy for an external endpoint.

Don’t return empty arrays and failures in the same shape

Silent failures are worse than crashes.
Swallowing exceptions and returning [] is already hard to debug in regular web apps.
For LLM agents it’s worse: the model interprets “no results” as “query too restrictive” and starts varying dates, locations, prices, and keywords across successive calls.

At minimum, separate success from failure in the response structure:

{
  "status": "error",
  "error_type": "timeout",
  "error_message": "Supplier API did not respond within 5 seconds.",
  "retryable": true,
  "retry_after_ms": 2000,
  "max_retries_remaining": 1
}

retryable: false carries significant weight.
Auth errors, permission denials, and logical input errors will never succeed with the same arguments.
If you make the model guess whether retrying is worthwhile, it almost always generates unnecessary retries.

This connects to the token exhaustion pattern I described in how AI coding tools break in three ways.
In sub-agent and auto-fix loops, the more ambiguous the failure signal, the more calls accumulate.
”It failed” isn’t enough.
The tool must return whether the failure is worth retrying.

Enforce budgets at the orchestrator, not in the prompt

The most practical pattern from the original article is orchestrator-level limits.
Hold max_tool_calls and max_total_cost_usd; when either is reached, disable tools and force the model to produce a final response.

This is a runtime constraint, not a request to the model.
Writing “don’t call tools more than necessary” in the prompt doesn’t work when the model is confused mid-session.
Call count, estimated cost, elapsed time, and consecutive same-tool invocations are better counted outside the LLM.

In the 1.67B token incident, the issue thread shows 436 requests at 0-second intervals, 224 requests/sec, and 436M tokens in one minute.
The session should have been killed long before reaching that scale, based on token growth rate or tool call frequency per session.

flowchart TD
    A[LLM response] --> B{Contains tool_calls?}
    B -->|No| C[Final response]
    B -->|Yes| D{Within call budget?}
    D -->|Yes| E[Execute tool]
    E --> F[Append to history]
    F --> A
    D -->|Over| G[Force final response with tools disabled]

If you’re building your own agent framework, add this early.
Before logging dashboards, before observability.
An agent without limits doesn’t “stop” on failure — it keeps spending.

Block duplicate calls and surface the previous result

When the same tool is called with identical arguments twice, there’s no need to hit the external API again.
The original article hashes the last 5 tool calls and returns duplicate_call_blocked on signature match.
In production, roughly 8% of all tool calls were unnecessary duplicates.

What to return here is not a mechanical HTTP 409 but an explanation the LLM can recover from:

{
  "status": "duplicate_call_blocked",
  "message": "This exact search_hotels call was already made earlier in this conversation. Use the previous result instead of calling again.",
  "retryable": false
}

Explicitly stating “the previous result is already in your context” works.
Models do miss their own recent outputs and tool results.
In long sessions, patterns emerge: reading the same file repeatedly, running the same test again, hitting the same API with slightly varied arguments.

In the article analyzing Claude Code quality regression from 234,760 tool calls, I looked at Read:Edit ratios and spurious stop phrases.
That was about model-side thinking depth.
The design here is a different layer of defense: even if the model thinks shallowly and loops back to the same operation, the execution layer can respond “you already did this.”

Put boundary validation right before the tool, not before the LLM

Even when you give the LLM a correct JSON schema, runtime values break.
Dates in the past, checkout before check-in, guest count out of range, numbers as strings — these happen regularly.
If invalid requests reach the external API, the supplier’s cryptic error comes back and the model starts guessing at corrections.

Validate at the boundary:

{
  "status": "validation_error",
  "errors": [
    {
      "field": "check_out",
      "message": "check_out must be after check_in"
    }
  ],
  "user_facing_hint": "Confirm the checkout date with the user before retrying.",
  "retryable_after_correction": true
}

Pydantic or Zod validators align the tool definition schema with runtime checks.
But don’t stop at emitting a schema.
When the LLM violates the schema, make the error message LLM-readable too.

This diverges from conventional API design.
For human-facing APIs, a 422 with field errors is usually sufficient.
For LLM-facing tools, the response should state whether the model should ask the user for clarification or can self-correct and retry.

All of this is about broken context memory

Every pattern above shares the same root.
is_complete is needed because the model can’t judge “I already have enough information.”
retryable: false is needed because the model doesn’t retain “this failed with these exact conditions” into the next turn.
duplicate_call_blocked is needed because the model doesn’t reference “I did this 3 turns ago.”

The LLM’s context window being written doesn’t mean it’s being referenced.
Transformer attention fades with positional distance, and in long sessions the model stops attending to tool results in the middle of the window.
Even with 128k tokens of context, the model’s effective working memory is far narrower.

A human analogy: 100 documents spread across a desk, but only the top 3 pages are being looked at.
Returning is_complete and next_action_hint in tool results means “put this information on those top 3 pages.”
It compensates for the model’s skimming from the result side.

In the 1.67B token incident, what happened was that as the session grew, earlier tool results and plan state fell outside the attention window, causing repeated execution of the same operations.
Budget caps and duplicate detection cut that loop externally, but the root cause is “the model doesn’t remember what just succeeded.”

When designing MCP servers, this perspective changes decisions.
The criterion for how much data to return shifts from “accuracy” to “can the model judge the situation from the latest turn alone?”
If one tool result is sufficient to understand the current state, the model won’t dig through history.
If the result is fragmented and requires cross-referencing with a response from 3 turns ago, the model calls again to confirm.

The patterns in the original article are framed as “loop prevention” and “silent failure prevention,” but what they protect is the fragility of LLM short-term memory.

Context compaction erases tool results

In long sessions, agent frameworks compress context.
Claude Code runs compaction when conversation history exceeds a threshold, replacing old turn details with summaries.
Tool results lose the most information in this process.
File contents, API response JSON, test output — all collapse to “read a file” or “called an API” in the summary.

After compaction, the model can only reference the summary text and the last few turns.
Whether a search 3 turns ago returned results: [], two hits, or an error is unrecoverable from the summary.
So the model calls the same tool again.
If is_complete: true or retryable: false was in the original tool result, that fact survives summarization more easily.
”Search complete, 2 results obtained, no re-search needed” retains meaning when summarized.
”results: [...]” loses its content entirely.

Designing tool results to be self-contained resists not just attention fade but also compaction.
If each tool result contains everything needed to assess the situation, compaction doesn’t destroy decision-making inputs.

Agent memory tackles the same problem across sessions

Systems like MemGPT and Letta place persistent memory outside the context window.
Even after a session ends, they retain where things left off, which APIs failed, what the user prefers.
RAG is the same direction — retrieve relevant information on demand instead of loading everything into context.

Tool response design addresses the same problem at a different layer.
Memory systems handle inter-session recall. Tool response design handles intra-session short-term memory.
Having one doesn’t eliminate the need for the other.
Even if persistent memory records “the last hotel search found 2 results,” it can’t prevent the model from repeating that search within the current session.
What prevents that is is_complete in the tool result and duplicate detection — not long-term memory.

Conversely, no amount of careful tool response design prevents cross-session repetition.
Stopping “running the same search today that I ran yesterday” requires inter-session memory.
Tool design and memory systems divide the memory problem between short-term and long-term.

Stateless tool responses contradict agent architecture

REST API design principles are stateless.
Each request is independent; the server holds no client state.
But in LLM agent tools, the caller is the side that “can’t hold state.”
The model sees the previous response but may not effectively see one from 5 turns ago.

So tool responses should be designed statefully.
”The current situation is X, progress is Y, decide whether to do Z next” — return this every time.
It violates REST aesthetics, but when the caller is an LLM rather than a human HTTP client, the stateless assumption doesn’t hold.

The same applies in MCP tool definitions.
The MCP server maintains session state internally and includes “current progress” in each tool result.
The model can make its next decision from the latest tool result alone.
When there’s no need to scroll back through earlier tool results, loops decrease.

Tool call records dynamically rewrite retryable

Both is_complete and retryable are static in the current design.
The tool developer decides “this error is retryable” or “this result is complete” and hardcodes it.
But in practice, whether something is retryable depends on call context.

Take an external API rate limit error.
Returning retryable: true with retry_after_ms is correct.
But if the same session has hit the rate limit 3 times in a row, the 4th attempt should probably return retryable: false.
A static flag can’t express this.

Accumulating tool call records enables dynamic decisions.
Look not just at in-session call history but at success rates from past sessions, resolution rates after retry, and recurrence patterns for identical errors.
”This API’s timeout resolves on retry 12% of the time.” “This validation_error pattern has 0 instances of the model self-correcting.” With records like these, retryable becomes a classifier output rather than a hardcoded value.

The classifier here doesn’t need to be a complex ML model.
Extract features from the last N tool call logs and predict retry success rate.
Logistic regression or XGBoost is sufficient.
Effective features: error type, consecutive same-tool call count, session elapsed time, previous result status.

The same approach extends to loop detection.
The duplicate detection described earlier uses exact hash matching on tool name and arguments.
But LLM loops are often not exact duplicates.
The model varies search keywords slightly, shifts dates, changes counts — “subtly different same operations” in sequence.
A classifier can detect “loop indicators” from argument similarity and in-session call patterns.
Exact duplicate detection fits in an if-statement; fuzzy loop detection is better suited to a classifier.

These long-term records differ from the agent memory discussed earlier.
Agent memory holds the model’s working state.
Tool call records are operational data on the infrastructure side.
They’re not directly referenced by the model — they’re backing information for the orchestrator to make retryable and duplicate decisions.

As the classifier learns across sessions, per-tool “retry success profiles” accumulate.
An external hotel search API’s timeout resolves on retry 60% of the time after 5 seconds, but a payment API’s auth error resolves 0% of the time.
Rather than branching this in static code, default retryable values are set automatically from operational data.
A tool developer’s rough “probably retryable” flag gets calibrated by production evidence.

The same data serves as a quality signal for tool design.
If a specific tool has abnormally high retry rates or loop incidence, its response design has a problem.
Missing is_complete, ambiguous error messages preventing model judgment, or weak validation letting invalid requests reach external APIs.
Long-term records feed back not just into orchestrator runtime decisions but into tool redesign.

The structure is the same as human-facing APM identifying bottlenecks from response time and error rate.
For agents, retry rate, loop rate, and duplicate rate are the metrics that reveal tool design bottlenecks.
The difference: observations don’t go to a dashboard — they flow through a classifier directly into the orchestrator’s decision logic.