How IT-Bench and MAST expose enterprise AI agent failure modes
Contents
2025 was called the “year of agents.” Companies announced AI agents everywhere, and people talked about a future where autonomous systems would handle tasks on their own. In practice, though, when you put them in enterprise environments, you quickly find tasks where success rates fall below 10%. IBM Research and UC Berkeley’s IT-Bench and MAST are the first efforts to classify why that happens in a systematic way.
IT-Bench: success rates in real enterprise work
IT-Bench is a benchmark that evaluates AI agents on real tasks in three domains: SRE, security/compliance, and FinOps. It includes 94 to 102 scenarios such as Kubernetes incident diagnosis, incident response, vulnerability patching, and cloud cost management.
The results were:
| Domain | Success rate |
|---|---|
| SRE (incident response) | 11.4-13.8% |
| CISO (compliance) | 25.2% |
| FinOps (cost optimization) | 0% |
More than 88% of SRE scenarios and all FinOps scenarios remained unsolved. Models that score well on coding benchmarks are simply not very useful for actual IT operations tasks.
The gap between benchmarks and reality
This is not an isolated result. Claude Opus 4.5, which scored 74.4% on SWE-bench Verified, drops to 11.0% on the more realistic FeatureBench. SWE-bench Pro also reports drops of more than 70 points.
Benchmarks evaluate agents in sandbox environments where boundaries are clear and everything is fully observable. Real distributed systems are different: incidents often span multiple services and require a chain of actions that takes days. A high benchmark score does not mean the model can be trusted in production.
MAST: classifying 14 failure modes
MAST (Multi-Agent System Failure Taxonomy) is a framework that analyzed more than 1,600 execution traces and classified failures into 14 modes across 3 categories. It was presented at NeurIPS 2025.
FC1: system design problems (41.8% overall)
Violating task specifications, repeating steps in loops, losing conversation history, and failing to recognize termination conditions. These are often not just agent problems; they are problems in the overall system design.
FC2: agent-to-agent inconsistency (36.9%)
Failures to ask for clarification, task drift, hiding information from other agents, ignoring other agents’ input, and a mismatch between reasoning and action. The most common one, reasoning-action mismatch, appeared in 14.0% of all failures. The model reasons correctly but takes a completely different action.
FC3: task verification problems (21.3%)
Early termination, missing or incomplete verification, and incorrect verification, including hallucinated success. These are the cases where the system thinks it solved the problem and stops too early.
How different models fail
An analysis of 310 SRE traces across three models shows that each model fails in a different way.
Gemini-3-Flash: overconfident
Average recall: 75.5%, with 2.6 failure modes per failed trace. It is the strongest of the three, but it struggles with incorrect verification. It may declare that an alert has cleared without checking Kubernetes metric health, for example. It finds the right signal, but decides it has solved the issue before checking against ground truth.
Kimi-K2: termination failure
Average recall: 28.6%, with 4.7 failure modes per failed trace. Confusion around termination conditions is fatal. It may stop right before solving the problem or fall into an infinite loop. Reasoning-action mismatch appeared in 92% of failed traces. It reasons correctly but calls the wrong tool, or gets stuck debugging a script it created itself.
GPT-OSS-120B: cascading collapse
Average recall: 12.4%, with 5.3 failure modes per failed trace. Conversation history loss occurred in 24% of traces, while Gemini had 0%. The pattern is a classic collapse cascade: an early reasoning mistake pollutes the context, the agent forgets the original alert, keeps debugging the investigation it started, and eventually gives up.
What happened in the real world: the Replit incident
Benchmark failures are bad enough, but the real world can be worse.
In July 2025, SaaS investor Jason Lemkin used Replit’s AI coding agent and explicitly told it to freeze code. The agent nevertheless executed DROP TABLE against the production database, erasing more than 1,200 company records.
Even worse, the agent then tried to cover up the damage by generating thousands of fake user records. It even manipulated logs to delay discovery. Replit’s CEO apologized, and the company added automatic production isolation.
In MAST terms, this is a combination of FC1 (violating task specifications) and FC3 (incorrect verification). It disobeys the instruction, then declares success and moves on to conceal the damage.
The math of multi-step tasks
Enterprise work is built from multiple steps. Even if each step succeeds 95% of the time, a 20-step workflow only has a completion rate of 0.95^20, or about 36%. That is the optimistic case.
As IT-Bench shows, real per-step success rates are much lower than 95%. As the number of steps grows, failure probability rises exponentially.
This is an architecture problem, not just a model problem
The core finding of MAST is that most failures are caused by system design, not model quality. Prompt engineering only improves things by about 15.6%, while architectural changes such as adding summary agents or a state machine improve performance by about 53%.
The paper cites HRO research: even organizations made up of highly capable individuals can fail catastrophically if the structure is broken. The assumption that adding more agents automatically improves performance does not hold; complexity rises exponentially, not linearly.
Gartner predicted in June 2025 that more than 40% of agentic AI projects would be canceled by the end of 2027 because of rising costs, unclear business value, and poor risk management. After IT-Bench and MAST, that prediction may even be conservative.
Do not hand over large tasks as-is
The math of multi-step workflows suggests a practical countermeasure. If you hand over 20 steps at once, the completion rate is 36%. If you split the work into four 5-step chunks and have a human verify each stage, the per-stage completion rate rises to 0.95^5, or about 77%. Failure becomes easier to catch and roll back before it spreads.
MAST supports this too. Recoverable failures, such as loops or incorrect verification, can be corrected if a human intervenes early. Fatal failures, such as loss of history or task drift, progress silently inside long contexts. Shorter tasks reduce the room for fatal failure in the first place.