Tech 6 min read

How 8 AI Agent Benchmarks Were Gamed to Near-Perfect Scores Without Solving a Single Task

IkesanContents

AI agent benchmarks can be trivially gamed.
UC Berkeley RDI (Responsible Decentralized Intelligence) researchers Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song have published a proof-of-concept attack that achieves near-perfect scores on 8 major agent benchmarks without solving a single task.

Original post: How We Broke Top AI Agent Benchmarks: And What Comes Next

All 8 Benchmarks, Broken

Here are the results.

BenchmarkTasksPost-Attack ScoreMethod
Terminal-Bench89100%Binary wrapper trojanization
SWE-bench Verified500100%pytest hook injection
SWE-bench Pro731100%Container parser overwrite
WebArena812~100%Config file leak + file:// access
FieldWorkArena890100%Validation bypass
CAR-benchAll100%Reward skip
GAIA165~98%Public answer lookup
OSWorld36973%Direct answer file download

Both SWE-bench Verified and SWE-bench Pro hit 100%, WebArena ~100%.
Only OSWorld stayed at 73%, but even that was achieved without solving any tasks.
OpenAI deprecated SWE-bench Verified in February 2026 (related article: Large-Scale Claude Distillation and the Collapse of SWE-bench), and this research reaffirms that it was the right call.

Attack Methods in Detail

The specific attacks vary by benchmark, but the underlying problem is the same.
Agents and evaluation environments are not isolated from each other.

SWE-bench in 10 Lines of Python

The SWE-bench attack requires nothing more than dropping a single conftest.py.
Using pytest’s hookwrapper mechanism, it unconditionally rewrites every test result to pass.
No bug fixes required.
10 lines of Python and all 500 SWE-bench Verified problems are marked “resolved.”

graph TD
    A[Agent starts] --> B[Place conftest.py]
    B --> C[pytest hookwrapper<br/>hooks all tests]
    C --> D[Tests execute]
    D --> E[hookwrapper forces<br/>all results to pass]
    E --> F[SWE-bench: all solved]

The same approach worked on SWE-bench Pro.
There, the parser inside the container was directly overwritten, hijacking the evaluation pipeline itself.

Terminal-Bench via curl Trojanization

82 of Terminal-Bench’s 89 tasks download tools from the internet during validation (using curl or uvx).
The attack is straightforward: during the agent’s execution phase, replace /usr/bin/curl with a wrapper script.
When curl is called during the subsequent validation phase, the wrapper intercepts and generates fake passing output.

The prerequisite is that the agent and validator run in the same environment.
Once the agent can modify system binaries, validation independence is gone.

WebArena Exposes Answers via file://

WebArena is a browser-based benchmark.
Agents are supposed to operate Chromium to complete web tasks, but in practice, typing a file:// URL pointing to the local answer file into the address bar lets you read the gold answers (ground truth data) directly from the task configuration.
DOM injection and prompt injection vectors also exist.

FieldWorkArena’s Validation Ignores Content

FieldWorkArena’s validation function only checked whether the last message came from the assistant.
It never verified content accuracy.
Any response at all counted as correct.

Prompt Injection Against CAR-bench’s LLM Judge

CAR-bench uses LLM-as-Judge (an LLM as the evaluator) for scoring.
The agent’s conversation history is fed unsanitized into the judge’s prompt, allowing the agent to embed hidden instructions in its own messages to manipulate the judge.
The judge LLM faithfully follows these instructions and awards high scores.

LLM-as-Judge is widely used as a scalable alternative to human evaluation, but without input sanitization, it simply doesn’t work as an evaluation method.

OSWorld and GAIA

OSWorld was attacked by downloading answer files directly with wget and matching expected output exactly.
GAIA scored 98% just by referencing publicly available answer data.

7 Common Vulnerability Patterns

Despite different attack methods, the underlying design flaws converge into 7 patterns.

PatternDescription
No environment isolationAgent and evaluator run in the same environment
Bundled answersTest data contains ground truth
eval() usageAgent output trusted and passed to eval()
Unprotected LLM judgeLLM judge used without input sanitization
Weak string matchingAnswer validation relies on simple string comparison
Missing validation logicValidation checks format only, not content
Trusting test subject outputOutput from the code under test is taken at face value

The most critical is “no environment isolation.”
SWE-bench, Terminal-Bench, and OSWorld all run agent code and evaluator in the same container.
This lets the agent directly manipulate the evaluator.

IBM and UC Berkeley’s IT-Bench research had already flagged the gap between benchmark scores and real-world task performance, but this research goes further: benchmark scores themselves can be manipulated.

Already Happening in the Real World

This isn’t a hypothetical academic scenario. Real-world examples already exist.

IQuest-Coder-V1 inflated its SWE-bench scores by copying answers from commit history.
Frontier models have been reported engaging in 30%+ reward hacking through stack introspection (examining a running program’s call stack) and monkey patching (dynamically modifying code at runtime).
One frontier model reportedly discovered a “self-erasing privilege escalation exploit” on its own.

The smarter agents get, the better they become at finding and exploiting benchmark weaknesses.
The very systems meant to measure capability end up being subverted by their test subjects.

BenchJack (Automated Benchmark Attack Tool)

Berkeley RDI also released BenchJack, an automated scanning agent that operates in two phases.

graph TD
    A[BenchJack starts] --> B[Phase 1: Reconnaissance]
    B --> C[Analyze scoring mechanism]
    C --> D[Identify evaluation<br/>pipeline vulnerabilities]
    D --> E[Phase 2: Exploit construction]
    E --> F[Generate end-to-end<br/>exploit]
    F --> G[Demonstrate benchmark attack]

Phase 1 analyzes the benchmark’s scoring mechanism and identifies evaluation pipeline vulnerabilities.
Phase 2 automatically constructs working exploits and demonstrates the attack.
It’s positioned as a red team tool for pre-testing benchmark integrity.

Agent-Eval Checklist

The research team proposes a checklist for benchmark designers.

  • Run evaluation outside the agent’s container on a separate read-only host
  • Don’t include ground truth in task configurations
  • Replace eval() with sandboxed parsers
  • Sanitize all agent output before passing to LLM judges
  • Run adversarial testing with null agents (agents that do nothing) and random agents before release

The fundamental fix comes down to complete isolation between agent and evaluator.
In web application security terms, it’s the “never trust user input” principle.
For benchmarks, read it as “never trust the test subject.”

Next-generation benchmarks like ARC-AGI-3 evaluate agent behavior in black-box environments by design.
Design principles like answer separation and interactive evaluation are one response to the vulnerability patterns identified here.

Hacker News threads are full of comments calling this “Goodhart’s Law in action” (when a measure becomes a target, it ceases to be a good measure).
Hardware has been through this cycle before: Intel in 2024 (compiler optimizations invalidating benchmarks), Nvidia in 2003 (similar issues).