Back-to-back releases of OpenAI GPT-5.3/5.4 and Saguaro-driven inference speedups

In just the first week of March 2026, OpenAI shipped two new models in the GPT-5 family back to back: GPT-5.3 Instant on March 2 and GPT-5.4 on March 5. Around the same time, an inference-speedup algorithm called Saguaro was posted to arXiv. Here’s a consolidated summary of all three.

GPT-5.3 Instant

OpenAI released GPT-5.3 Instant on March 2 as the fourth model in the GPT-5 series. A System Card (the model’s safety evaluation report) was published alongside it. The design principles were: faster responses; better understanding of context when using web search; and fewer unnecessary preambles or overly assertive phrasing.

Hallucination reduction

The System Card reports concrete numbers for reductions in hallucination (the model generating content that conflicts with fact).

Evaluation setting	Reduction
Using web search for high‑risk questions	26.8%
Answering from the model’s own knowledge only	19.7%
Conversations users reported as “factually wrong” (with web search)	22.5%
Conversations users reported as “factually wrong” (knowledge only)	9.6%

The improvements are largest when web search is involved. Note that these measurements come from OpenAI’s internal evaluation set; correspondence with external benchmarks is not disclosed.

Changes in conversational tone

GPT-5.3 Instant intentionally suppresses “unnecessary refusals” and “ethical preambles before answering.” Previous GPT-5 models had a habit of inserting cautionary notes even for non‑edge‑case questions, but GPT-5.3 Instant does this far less.

The way web search and prior knowledge are integrated also changed. Rather than “listing search results as is,” the model “uses its existing understanding to contextualize the latest information.” The copy‑paste feel is reduced, making responses read more naturally.

Safety evaluation regressions

It’s not all improvements. The System Card also candidly notes regressions (areas that worsened versus prior models).

Metric	vs GPT‑5.1 Instant	vs GPT‑5.2
Overall restricted content	Improved	Slightly worse
Sexual content	Worse	Worse
Self‑harm	–	Worse

OpenAI says it will deploy system‑level mitigations on ChatGPT (external filtering, etc.) to compensate.

On the HealthBench medical benchmark, GPT-5.3 Instant scores 54.1% versus GPT-5.2’s 55.4%, a small step back. Strengths include “asking for missing information before answering” (+4.4%) and “avoiding definitive claims when uncertain” (+4.0%). Weaknesses include “deciding when a referral is needed” (−10.1%) and “tailoring answers to local medical conditions” (−5.5%).

Publishing regressions rather than hiding them is commendable. However, methodological details are deferred to the complete version on the OpenAI Deployment Safety Hub, which makes independent reproduction difficult.

API pricing

GPT-5.3 Instant powers a model inside ChatGPT and is also available via API as gpt-5.3-chat-latest.

Item	Price (per 1M tokens)
Input	$1.75
Cached input	$0.175
Output	$14.00

This is a price increase from the GPT‑5.1 family (Input $1.25 / Output$ 10.00), matching GPT‑5.2.

Reference: GPT-5.3 Instant System Card

GPT-5.4

Released on March 5, GPT-5.4 is positioned by OpenAI as “the most capable and efficient frontier model for professional work.” It ships in three variants: Standard, Thinking, and Pro.

Differences among the three variants

Variant	Use case	Context
GPT‑5.4 (Standard)	General API use. Can ingest very large codebases or legal documents as is	Up to 1 million tokens
GPT‑5.4 Thinking	Reasoning‑oriented. Expands its chain of thought for multi‑step tasks	1 million tokens
GPT‑5.4 Pro	Highest performance. Targeted at specialized tasks such as law and finance	1 million tokens

The 1M‑token context window is the largest yet among OpenAI models. Requests that exceed 272,000 tokens are billed at double the usual rate.

GPT‑5.4 Pro tops Mercor’s APEX‑Agents benchmark, which evaluates professional skills in law and finance.

computer use

The headline feature in GPT‑5.4 is native computer use. For the first time in a general‑purpose OpenAI model, it can look at screenshots and return mouse/keyboard operations—an approach similar to Anthropic’s Claude computer use.

The flow looks like this:

graph TD
    A[ハーネスがスクリーンショットを撮影] --> B[GPT-5.4に画像を送信]
    B --> C[GPT-5.4が画面を解析]
    C --> D{操作方法を選択}
    D -->|コードモード| E[PythonでPlaywright等の<br/>ライブラリを使ったコードを生成]
    D -->|スクリーンショットモード| F[座標ベースのクリック・<br/>タイプ・スクロール命令を生成]
    E --> G[ハーネスがコードを実行]
    F --> G
    G --> H[操作結果の画面を再撮影]
    H --> B

In code mode, the model generates Python that calls browser‑automation libraries like Playwright. In screenshot mode, it issues direct coordinate‑based click/type/scroll commands. In both modes, the harness loops by re‑capturing the screen and sending it back to the model.

Benchmark results

GPT‑5.4 set new records across computer‑use benchmarks.

Benchmark	What it measures	GPT‑5.4	GPT‑5.2	Human
OSWorld‑Verified	GUI operations on a desktop (OS control, app use, multi‑step workflows)	75.0%	47.3%	72.4%
WebArena‑Verified	Real tasks in a web browser (uses both DOM and screenshots)	67.3%	65.4%	–
GDPval	Knowledge‑work tasks spanning 44 professions. Evaluates whether outputs meet or exceed domain‑expert quality	83.0%	70.9%	–

Surpassing the human score (72.4%) with 75.0% on OSWorld‑Verified is especially notable—the first time AI outperforms humans on desktop operation tasks. That’s a jump of nearly 28 points over GPT‑5.2’s 47.3%.

GPT‑5.4 also suppresses hallucinations better than GPT‑5.2, with a 33% drop in claim‑level errors and an 18% drop in responses containing errors.

Tool Search

For API users, Tool Search is particularly practical.

With traditional tool calling (having the model invoke external tools via the API), you had to enumerate every tool definition in the system prompt for every request. Five or ten tools isn’t an issue, but at dozens of tools the definitions alone consume thousands to tens of thousands of tokens—especially painful for agent systems.

Here’s how Tool Search works:

graph LR
    A[APIリクエスト時に<br/>軽量なツール一覧だけを渡す] --> B[モデルがタスクを分析]
    B --> C[必要なツールの定義を<br/>動的にルックアップ]
    C --> D[該当ツールの定義だけを<br/>コンテキストに追加]
    D --> E[ツールを実行]

The model first receives a lightweight tool list (name and short description only) and loads the full definitions only when it actually needs to use a tool. In OpenAI’s tests, environments with many tools saw a 47% reduction in token usage. Besides cost savings, keeping unnecessary definitions out of context improves the accuracy of tool selection.

Safety evaluation for CoT monitoring

GPT‑5.4 Thinking introduces a new safety evaluation around controllability of chain‑of‑thought (CoT).

The underlying concern is this: reasoning models expand an internal chain of thought before producing answers. If a model could intentionally disguise its CoT, then even if the surface output looked safe, it might still be performing unsafe internal reasoning—undermining CoT‑based monitoring as a safety measure.

In contemporaneous research, “Reasoning models struggle to control their chains of thought,” OpenAI evaluated 13 reasoning models using a suite called CoT‑Control. Results show very low self‑control scores for CoT (0.1%–15.4%) across all models. Even when models recognize that they are failing to follow CoT instructions, they still cannot comply.

From a safety standpoint, that is good news: if models cannot arbitrarily control CoT, they also cannot easily disguise it to evade monitoring. CoT monitoring therefore remains a viable AI‑safety technique for now.

Reference: Reasoning models struggle to control their chains of thought

API pricing

Model	Input	Cached input	Output
GPT‑5.4 (≤272K)	$2.50	$0.25	$15.00
GPT‑5.4 (>272K)	$5.00	$0.50	$22.50
GPT‑5.4 Pro (≤272K)	$30.00	–	$180.00
GPT‑5.4 Pro (>272K)	$60.00	–	$270.00

(Per 1 million tokens.)

Standard GPT‑5.4 is about 1.4× the GPT‑5.3 pricing (Input $1.75 / Output$ 14.00). Pro is 12× higher still, so choose carefully. Using cached input cuts input cost by 90%, so cache strategy matters for agents that reuse prompts.

Reference: Introducing GPT-5.4

Saguaro: Removing Speculative Decoding’s serial dependency via speculation

A paper titled “Speculative Speculative Decoding” by Tanishq Kumar, Tri Dao, and Avner May—and its implementation, the Saguaro algorithm—was posted to arXiv in March. It accelerates LLM inference by addressing a structural limitation of existing Speculative Decoding head‑on.

What is Speculative Decoding?

A quick refresher: when LLMs generate text, they normally produce one token at a time—autoregressive decoding. Each token requires a full pass of the large model, which is slow.

Speculative Decoding mitigates this by having a small, fast “draft model” predict multiple tokens ahead, then validating them in a large “verifier model” in one shot. Because validation can be parallelized, if the draft’s predictions are correct, throughput improves substantially.

graph TD
    A[ドラフトモデル<br/>小型・高速] -->|複数トークンを<br/>まとめて生成| B[トークン列の候補]
    B --> C[検証モデル<br/>大型・高精度]
    C -->|並列に検証| D{各トークンを<br/>採用 or 棄却}
    D -->|採用されたトークンを確定| E[出力に追加]
    D -->|棄却されたら<br/>そこからやり直し| A

Structural limitation

Speculative Decoding has a bottleneck: draft generation and verification depend on each other serially.

The verifier must wait until the draft model finishes producing a token sequence.
While the verifier is checking, the draft model sits idle.

One side is always idle, and this alternation sets a ceiling on speed.

The SSD idea

The core of SSD (Speculative Speculative Decoding) is to have the draft model also “predict the verification outcome.”

graph TD
    A[ドラフトモデルが<br/>トークン列を生成] --> B[同時に検証結果を予測<br/>どこまで採用されるか?]
    B --> C[予測に基づいて<br/>次のトークン列も先に準備]

    A --> D[検証モデルが<br/>並行して検証を実行]

    D --> E{予測と実際の<br/>検証結果が一致?}
    E -->|一致| F[準備済みの次のトークン列を<br/>即座に返却]
    E -->|不一致| G[通常のSpeculative Decoding<br/>にフォールバック]

    F --> H[ドラフト生成の<br/>待ち時間が実質ゼロに]

The draft model predicts “how far the verifier is likely to accept” and pre‑generates the next token block based on that. If verification agrees, the pre‑prepared block can be returned immediately—eliminating the draft‑generation overhead.

Three challenges and Saguaro’s solutions

The paper identifies three challenges to implementing SSD and proposes solutions. Combined, these make up the Saguaro algorithm.

Challenge	Description	Saguaro’s solution
Predicting verification outcomes accurately	You need to forecast how far the verifier will accept	Predict “bonus tokens” (tokens added after verification) using the draft model’s top‑probability tokens; accuracy up to 90%
Trade‑off between cache hit rate and acceptance rate	Pushing verification‑prediction accuracy can degrade the draft’s own quality	A sampling algorithm that balances the two
Fallback when predictions fail	How to recover when predictions are wrong; the best strategy depends on batch size	Switch fallback strategies by batch size; maintains 20% speedup over Speculative Decoding even at large batch sizes

Performance

Experiments on open‑source inference engines show:

Baseline	Speedup
Optimized Speculative Decoding	2×
Standard autoregressive decoding	5×

They also report that the throughput/latency Pareto frontier is pushed outward across all batch sizes compared to prior methods.

Tri Dao, a co‑author, is known for developing FlashAttention (an algorithm that accelerates the Transformer attention mechanism). If SSD becomes production‑ready, expect direct benefits for API pricing, LLM deployment on edge devices (e.g., smartphones and IoT with limited compute), and latency in real‑time applications.

Paper: Speculative Speculative Decoding (arxiv.org/abs/2603.03251)