GPT-Image-2 Leaked on LM Arena Under Tape-Themed Codenames in Two Waves

On April 4, 2026, three unfamiliar models suddenly appeared on LM Arena (formerly Chatbot Arena, now arena.ai), the benchmark site for image generation models.

maskingtape-alpha
gaffertape-alpha
packingtape-alpha

Codenames riffing on types of tape.
They were pulled from the Arena within hours, but the blind test results testers left behind during that brief window were striking.
They clearly outperformed the reigning top model at the time, Google DeepMind’s Nano Banana Pro.

The models are believed to be GPT-Image-2, OpenAI’s next-generation image generation model still in development.

And in mid-April, the tape models resurfaced under new codenames.

OpenAI Has Done Anonymous Arena Tests Before

This isn’t the first time OpenAI has anonymously tested unreleased models on LM Arena.

graph TD
    A["Submit to LM Arena<br/>under anonymous codename"] --> B["Community runs<br/>blind tests"]
    B --> C["Pulled from Arena<br/>within hours to days"]
    C --> D["Official release<br/>weeks later"]

In December 2025, models codenamed Chestnut and Hazelnut appeared on the Arena and were officially released as GPT Image 1.5 a few weeks later.

The tape-themed codenames follow the same pattern. Developer Pieter Levels (@levelsio) and VC Justine Moore (@venturetwins) reported specific test cases, and the story spread from there.

OpenAI's new image model GPT-Image-2 has leaked

It seems to have extremely good world knowledge and great text rendering

Possibly better than Nano Banana Pro
— @levelsio (@levelsio) April 4, 2026

(2) World creation
The model is exceptionally good at taking relatively simple prompts and creating a detailed environment. For these images, I gave prompts like "anime shot inside a16z office" or "video game of SF."
— Justine Moore (@venturetwins) April 3, 2026

Test Results: What Exactly Was So Strong?

No official Elo scores were published for the tape models before removal, but community blind tests reported the following results.

Text Rendering

Text accuracy was said to be 90-95% with GPT Image 1.5, but the tape models reached near-perfect levels.
Reports include accurately rendering handwritten-style medical notes and speech bubble text inside manga panels.

World Knowledge

The models architecturally reproduced IKEA store exteriors accurately and generated YouTube and Windows UIs indistinguishable from actual screenshots.

For a first-person Minecraft screenshot with correct in-game UI, maskingtape-alpha dominated all competing models.

Photorealism

Textures and lighting approached real photography, with portraits evaluated as “indistinguishable from real photos.”
Anatomical accuracy of hands and reflections in sunglasses also improved.

One tester put it as “Nano Banana Pro looks like DALL-E in comparison.”
Sweeping all three categories — realism, text, and world knowledge — simultaneously was rare.

Technical Changes in GPT-Image-2 (Based on Leaked Information)

While GPT Image 1.5 was a model integrated into GPT-4o, GPT-Image-2 reportedly adopts a standalone architecture.

Aspect	GPT Image 1.5	GPT-Image-2 (Leaked)
Architecture	GPT-4o integrated (2-stage inference)	Standalone (single-pass inference)
Approach (estimated)	Autoregressive + diffusion hybrid	Autoregressive + diffusion hybrid
Max aspect ratio	3:2	16:9
Text accuracy	90-95%	Near-perfect
Color cast	Warm yellowish cast	Neutral
Generation speed	8-12 seconds	Under 3 seconds (predicted)
Max resolution	1536x1024	2048x2048 (predicted)

Support for 16:9 widescreen output is a major practical improvement. Whether including “Format 16:9” in the prompt produced 16:9 output was used as a method to identify GPT-Image-2 activations.

Single-Pass Inference

GPT Image 1.5 used a two-stage process: first interpreting input through text understanding (autoregressive phase), then generating pixels through image generation (diffusion phase).

GPT-Image-2 unifies this into a single pass, with text understanding and image generation proceeding simultaneously. The massive speed improvement (8-12 seconds to under 3 seconds) is largely attributed to this architectural change.

The Yellow Cast Problem in GPT Image 1.5

Images generated by GPT Image 1.5 tended to have a warm yellowish color cast, particularly noticeable in prompts requesting white backgrounds or neutral tones. The tape models resolved this yellow cast, significantly improving color reproduction.

Current LM Arena Leaderboard

As of April 9, 2026, the Elo ranking based on over 4.5 million blind test votes (excerpt).

Rank	Model	Elo
1st	gemini-3.1-flash-image-preview (Google)	1264 ± 6
2nd	gpt-image-1.5-high-fidelity (OpenAI)	1241 ± 4
3rd	gemini-3-pro-image-preview-2k (Google)	1237 ± 4
24th	gpt-image-1	1115 ± 3
51st	dall-e-3	968 ± 4

The tape models don’t appear on this leaderboard.
LM Arena has a “Battle Mode” feature for blind testing — two anonymous models generate images from the same prompt, and users vote on which is better.
Model names are only revealed after voting, and the tape models existed solely in this Battle Mode rotation.
That’s why searching the leaderboard won’t turn them up.

Google’s Gemini 3.1 Flash Image Preview currently holds the top spot, with GPT Image 1.5 at second. An official GPT-Image-2 release could significantly shake up these rankings.

The End of the DALL-E Brand

OpenAI plans to retire DALL-E 2 and DALL-E 3 on May 12, 2026.
Going forward, everything consolidates under the “GPT Image” series, ending DALL-E’s run as a brand name.

In a separate context, OpenAI also shut down Sora, its video generation model, on March 24, 2026.
Inference costs ran $15 million per day against lifetime revenue of $2.1 million — a staggering deficit.
OpenAI’s multimedia strategy is clearly converging on GPT Image for image generation.

Second Wave: Duct-Tape Codenames Resurface

Around April 14-15, three new tape-themed codenames appeared in LM Arena’s Battle Mode.

duct-tape-1
duct-tape-2
duct-tape-3

Unlike the first wave of maskingtape/gaffertape/packingtape that was pulled within hours, the duct-tape models weren’t immediately removed and remained in the Battle Mode rotation.
They didn’t appear on the official leaderboard — they only showed up as opponents in Battle Mode, where model names are revealed only after voting.
However, as of April 16, multiple attempts in Battle Mode failed to surface any duct-tape models, suggesting they may have been removed as well.

gpt-image-2 was tested in lmarena last week and has now reappeared under different names: duct-tape-1 duct-tape-2 duct-tape-3
— Haider (@haider1) April 15, 2026

GPT Image V2 in on LM Arena. It has three variations; Duct Tape 1, 2 and 3. Duct Tape 2 and 3 looks better.
— can (@marmaduke091) April 14, 2026

How Battle Mode Works

To encounter the duct-tape models, you needed to use Battle Mode. Here’s how it worked:

Go to arena.ai
Make sure “Battle Mode” is selected in the top left
Enter a prompt to generate images
Two anonymous models generate images side by side — vote for the better one
Model names are revealed after voting
If you see duct-tape-1 / duct-tape-2 / duct-tape-3, that’s the model believed to be GPT-Image-2

It was pure luck — you couldn’t choose your opponent model.
Like the first wave, the duct-tape models appear to have been pulled after a short period.

Differences Between Variants

Community test reports rated duct-tape-2 and duct-tape-3 highly, with duct-tape-1 considered relatively lightweight.
duct-tape-3 was praised for the strongest detail work, capable of generating intricate backgrounds while preserving character art styles from reference images.

なにこれ… OpenAIの次のモデル？らしいduct-tape-3、今までの画像生成とレベルが違う。参照画像のキャラの画風を崩さずにここまで細かい背景を作成できる。しかも遠くの看板にユーザーネームを入れられるくらいテキスト描画精度が高い。完全にNanobanaを超えてる。
— 海馬れいしょ (@visual_memory_) April 15, 2026

Japanese Text Rendering Accuracy

The duct-tape models showed significant improvement in Japanese text rendering.
In tests generating Japanese train advertisements from prompts, the accuracy of Japanese text layout and character reproduction was remarkably high.

arena aiで出現中のduct-tape（GPT-Image2だと言われてる）ちょっとすごすぎるかも。プロンプトはこれだけ「添付のキャラクターの形をしたアイスの広告、日本の電車内広告」日本語とレイアウトの再現度が高すぎる
— とらの (@TlanoAI) April 15, 2026

Rubik’s Cube Mirror Reflection Still Unsolved

In a Rubik’s Cube mirror reflection test — whether the color arrangement of a cube reflected in a mirror is physically correct — GPT-Image-2 still fails. The limits of spatial reasoning persist even across generations.

Release Timeline and Pricing Predictions

Analyst consensus expects an official release between late April and mid-May 2026.
Given the alignment with the DALL-E retirement date (May 12), that window looks likely.

API pricing is predicted at $0.15-$0.20 per image.
GPT Image 1.5’s high-fidelity mode (1024x1024) costs $0.133-$0.200, so no major price shift is expected.

The first wave was pulled within hours, but the duct-tape variants stuck around in Battle Mode a bit longer.
OpenAI is likely running intentional blind tests at this point.
Chestnut/Hazelnut went from Arena appearance to official release in a few weeks, so GPT-Image-2 likely lands around the May 12 DALL-E retirement date.