ARC-AGI-3 announced, frontier AI in interactive inference less than 1%
Contents
On March 24, 2026, François Chollet and colleagues released the interactive inference benchmark “ARC-AGI-3”. The technical report, evaluation harness (SDK), and leaderboard have been released at the same time, and the ARC Prize 2026 competition track has also been opened.
What ARC-AGI-3 is trying to measure
ARC-AGI-1 and 2 were static puzzles'' that inferred conversion rules by looking at grid input-output pairs. ARC-AGI-3 has undergone major changes since then, becoming an Interactive Reasoning Benchmark (IRB)” in which an agent infers rules while actually interacting with the environment.
Specifically, agents are thrown into a turn-based game environment. Goals will not be disclosed. There are no instructions. I can’t even tell you what the victory conditions are. Agents must act, receive feedback, build a model of their environment, and discover their goals on their own.
The design principle is simple: create tasks that humans can easily understand, but that AI without prior knowledge cannot solve. No numbers, letters, or cultural symbols are used. It is a visual task consisting of only a 64x64 grid and 16 colors, and is designed to preclude the “pattern memory” that language models can derive from natural language.
The probability that a random agent will clear a level is set to less than 1 in 10,000 (such as 1/355 in LS20), so a brute force approach will not work.
Frontier LLM score
All major Frontier models account for less than 1% of semi-private leaderboards as of March 2026.
| Provider | Model | Score |
|---|---|---|
| Gemini 3.1 Pro Preview | 0.37% | |
| OpenAI | GPT 5.4 (High) | 0.26% |
| Anthropic | Opus 4.6 (Max) | 0.25% |
| xAI | Grok-4.20 (Beta 0309 Reasoning) | 0.00% |
In the preview competition held in July-August 2025, the combination of CNN + reinforcement learning, not Frontier LLM, came out on top.
| Entry | Approach | Score |
|---|---|---|
| StochasticGoose (Tufa Labs) | CNN + Reinforcement Learning | 12.58% |
| Blind Squirrel | State graph exploration | 6.71% |
There were 486 human participants, and it was confirmed that they were able to complete all tasks 100% of the time.
Evaluation index RHAE
An indicator called RHAE (Relative Human Action Efficiency) is used for evaluation.
At each level, calculate the ratio of “number of AI actions ÷ human baseline (number of actions of the second best human)” and square it to create a score. The 5 levels are averaged using linear weighting with higher weights towards the latter half.
If the AI clears the stage with the same number of actions as humans, the score will be 1.0 (maximum value). If you use more than 5 times the number of actions than a human, it will be cut off and will be treated as 0 points.
The intent of this method is to measure not only whether you completed the test in the end, but also how efficiently you learned. For a task that requires 1000 moves by an AI and 10 moves by a human, the scores will vary greatly even if both answers are correct.
ARC-AGI-2 saturation problem and its countermeasures
ARC-AGI-2 achieved high scores in 2025 with frontier models such as Gemini 3, but the report says there is evidence of contamination in the training data. Reasons cited include Gemini 3’s accurate use of integer-to-color mapping without instructions.
ARC-AGI-3 seeks to negate the “memorize question/answer pairs” approach by creating an interactive environment. Evaluation of an agent that moves in real time in an environment that is randomly generated each time requires on-the-spot reasoning rather than memory replay.
Datasets and resources
| Dataset | Number of environments |
|---|---|
| Public Demo | 25 |
| Semi-Private | 55 |
| Fully Private | 55 |
The human data collected for the technical report included 486 unique participants, 414 candidate environments, and 2,893 attempts. Median attempt time was 7.4 minutes, and median success time was 8.1 minutes. Participants were paid a fixed reward (5 per clear.
In ARC Prize 2026, both tracks ARC-AGI-2 (final year) and ARC-AGI-3 will be held on Kaggle. The total prize pool is $2 million.
The SDK, evaluation harness, and Public Demo 25 environment are available at arcprize.org/arc-agi/3.
What exactly is the test?
For those who have read this far and are wondering, “Grid puzzle? Interactive reasoning? What is this about?”, I will explain it in a little more detail.
ARC-AGI-1 and 2 are “Do you understand this law?” test
First, let’s start with the predecessor ARC-AGI-1 and 2. This was a pattern reasoning test using colored grids.
For example, a problem like this arises.
| input | output | |
|---|---|---|
| Example 1 | One red square in the upper left | One red square in the lower right |
| Example 2 | One blue square in the upper left | One blue square in the lower right |
| Actual | One green square in the upper left | ? |
The answer is “There is one green square at the bottom right.” The rule is “reverse the position of the squares diagonally.”
The actual problem is more complex, but the task is the same. “Look at some examples of input and output and guess the hidden law” test. It’s similar to the shape questions on IQ tests.
ARC-AGI-3 is a “Think for yourself what you should do” test
With ARC-AGI-3, the approach has fundamentally changed. Instead of looking at a problem and coming up with an answer, you are thrown into the game.
graph TD
A[64×64マスのゲーム画面が表示される<br/>ルール説明なし、チュートリアルなし] --> B[画面を見て何か操作する<br/>マスを塗る、動かすなど]
B --> C[操作に応じて画面が変化する]
C --> D{変化のパターンから<br/>ゲームの目的を推測}
D -->|まだわからない| B
D -->|わかった| E[推測に基づいて行動]
E --> F{クリア?}
F -->|失敗| B
F -->|成功| G[レベルクリア]
You have to start playing the game without any instructions and realize on your own that the rules are probably like this. It’s almost like learning a board game for the first time by watching other people play it without reading the instructions.
It only takes a few minutes for a human to realize, “Oh, this is probably what’s going on.” In the experiment, 486 participants were able to complete all tasks. Average time is about 8 minutes.
On the other hand, the strongest AIs (GPT 5.4, Claude Opus 4.6, Gemini 3.1 Pro) as of March 2026 are less than 1% across the board. Almost nothing has been cleared.
Why is this difficult for AI?
Current large-scale language models (LLMs) work by extracting patterns learned from large amounts of text. I can answer programming questions because I’ve seen a lot of similar code. You can solve math problems because you know similar solutions.
The environment of ARC-AGI-3 is randomly generated each time. Reusing patterns seen in the past won’t work. Furthermore, because it does not use any letters or numbers, and consists only of a 64x64 color grid, its greatest weapon, which is its linguistic ability, is also blocked.
This test purely measures your ability to adapt on the fly to new situations. The amount of knowledge doesn’t matter. This is something that humans do on a daily basis, but it is the area where current AI is weakest.