Tech 8 min read

GPT-Image-2 Leaked on LM Arena Under Tape-Themed Codenames in Two Waves

IkesanContents

On April 4, 2026, three unfamiliar models suddenly appeared on LM Arena (formerly Chatbot Arena, now arena.ai), the benchmark site for image generation models.

  • maskingtape-alpha
  • gaffertape-alpha
  • packingtape-alpha

Codenames riffing on types of tape.
They were pulled from the Arena within hours, but the blind test results testers left behind during that brief window were striking.
They clearly outperformed the reigning top model at the time, Google DeepMind’s Nano Banana Pro.

The models are believed to be GPT-Image-2, OpenAI’s next-generation image generation model still in development.

And in mid-April, the tape models resurfaced under new codenames.

OpenAI Has Done Anonymous Arena Tests Before

This isn’t the first time OpenAI has anonymously tested unreleased models on LM Arena.

graph TD
    A["Submit to LM Arena<br/>under anonymous codename"] --> B["Community runs<br/>blind tests"]
    B --> C["Pulled from Arena<br/>within hours to days"]
    C --> D["Official release<br/>weeks later"]

In December 2025, models codenamed Chestnut and Hazelnut appeared on the Arena and were officially released as GPT Image 1.5 a few weeks later.

The tape-themed codenames follow the same pattern. Developer Pieter Levels (@levelsio) and VC Justine Moore (@venturetwins) reported specific test cases, and the story spread from there.

Test Results: What Exactly Was So Strong?

No official Elo scores were published for the tape models before removal, but community blind tests reported the following results.

Text Rendering

Text accuracy was said to be 90-95% with GPT Image 1.5, but the tape models reached near-perfect levels.
Reports include accurately rendering handwritten-style medical notes and speech bubble text inside manga panels.

World Knowledge

The models architecturally reproduced IKEA store exteriors accurately and generated YouTube and Windows UIs indistinguishable from actual screenshots.

For a first-person Minecraft screenshot with correct in-game UI, maskingtape-alpha dominated all competing models.

Photorealism

Textures and lighting approached real photography, with portraits evaluated as “indistinguishable from real photos.”
Anatomical accuracy of hands and reflections in sunglasses also improved.

One tester put it as “Nano Banana Pro looks like DALL-E in comparison.”
Sweeping all three categories — realism, text, and world knowledge — simultaneously was rare.

Technical Changes in GPT-Image-2 (Based on Leaked Information)

While GPT Image 1.5 was a model integrated into GPT-4o, GPT-Image-2 reportedly adopts a standalone architecture.

AspectGPT Image 1.5GPT-Image-2 (Leaked)
ArchitectureGPT-4o integrated (2-stage inference)Standalone (single-pass inference)
Approach (estimated)Autoregressive + diffusion hybridAutoregressive + diffusion hybrid
Max aspect ratio3:216:9
Text accuracy90-95%Near-perfect
Color castWarm yellowish castNeutral
Generation speed8-12 secondsUnder 3 seconds (predicted)
Max resolution1536x10242048x2048 (predicted)

Support for 16:9 widescreen output is a major practical improvement. Whether including “Format 16:9” in the prompt produced 16:9 output was used as a method to identify GPT-Image-2 activations.

Single-Pass Inference

GPT Image 1.5 used a two-stage process: first interpreting input through text understanding (autoregressive phase), then generating pixels through image generation (diffusion phase).

GPT-Image-2 unifies this into a single pass, with text understanding and image generation proceeding simultaneously. The massive speed improvement (8-12 seconds to under 3 seconds) is largely attributed to this architectural change.

The Yellow Cast Problem in GPT Image 1.5

Images generated by GPT Image 1.5 tended to have a warm yellowish color cast, particularly noticeable in prompts requesting white backgrounds or neutral tones. The tape models resolved this yellow cast, significantly improving color reproduction.

Current LM Arena Leaderboard

As of April 9, 2026, the Elo ranking based on over 4.5 million blind test votes (excerpt).

RankModelElo
1stgemini-3.1-flash-image-preview (Google)1264 ± 6
2ndgpt-image-1.5-high-fidelity (OpenAI)1241 ± 4
3rdgemini-3-pro-image-preview-2k (Google)1237 ± 4
24thgpt-image-11115 ± 3
51stdall-e-3968 ± 4

The tape models don’t appear on this leaderboard.
LM Arena has a “Battle Mode” feature for blind testing — two anonymous models generate images from the same prompt, and users vote on which is better.
Model names are only revealed after voting, and the tape models existed solely in this Battle Mode rotation.
That’s why searching the leaderboard won’t turn them up.

Google’s Gemini 3.1 Flash Image Preview currently holds the top spot, with GPT Image 1.5 at second. An official GPT-Image-2 release could significantly shake up these rankings.

The End of the DALL-E Brand

OpenAI plans to retire DALL-E 2 and DALL-E 3 on May 12, 2026.
Going forward, everything consolidates under the “GPT Image” series, ending DALL-E’s run as a brand name.

In a separate context, OpenAI also shut down Sora, its video generation model, on March 24, 2026.
Inference costs ran $15 million per day against lifetime revenue of $2.1 million — a staggering deficit.
OpenAI’s multimedia strategy is clearly converging on GPT Image for image generation.

Second Wave: Duct-Tape Codenames Resurface

Around April 14-15, three new tape-themed codenames appeared in LM Arena’s Battle Mode.

  • duct-tape-1
  • duct-tape-2
  • duct-tape-3

Unlike the first wave of maskingtape/gaffertape/packingtape that was pulled within hours, the duct-tape models weren’t immediately removed and remained in the Battle Mode rotation.
They didn’t appear on the official leaderboard — they only showed up as opponents in Battle Mode, where model names are revealed only after voting.
However, as of April 16, multiple attempts in Battle Mode failed to surface any duct-tape models, suggesting they may have been removed as well.

How Battle Mode Works

To encounter the duct-tape models, you needed to use Battle Mode. Here’s how it worked:

  1. Go to arena.ai
  2. Make sure “Battle Mode” is selected in the top left
  3. Enter a prompt to generate images
  4. Two anonymous models generate images side by side — vote for the better one
  5. Model names are revealed after voting
  6. If you see duct-tape-1 / duct-tape-2 / duct-tape-3, that’s the model believed to be GPT-Image-2

It was pure luck — you couldn’t choose your opponent model.
Like the first wave, the duct-tape models appear to have been pulled after a short period.

Differences Between Variants

Community test reports rated duct-tape-2 and duct-tape-3 highly, with duct-tape-1 considered relatively lightweight.
duct-tape-3 was praised for the strongest detail work, capable of generating intricate backgrounds while preserving character art styles from reference images.

Japanese Text Rendering Accuracy

The duct-tape models showed significant improvement in Japanese text rendering.
In tests generating Japanese train advertisements from prompts, the accuracy of Japanese text layout and character reproduction was remarkably high.

Rubik’s Cube Mirror Reflection Still Unsolved

In a Rubik’s Cube mirror reflection test — whether the color arrangement of a cube reflected in a mirror is physically correct — GPT-Image-2 still fails. The limits of spatial reasoning persist even across generations.

Release Timeline and Pricing Predictions

Analyst consensus expects an official release between late April and mid-May 2026.
Given the alignment with the DALL-E retirement date (May 12), that window looks likely.

API pricing is predicted at $0.15-$0.20 per image.
GPT Image 1.5’s high-fidelity mode (1024x1024) costs $0.133-$0.200, so no major price shift is expected.


The first wave was pulled within hours, but the duct-tape variants stuck around in Battle Mode a bit longer.
OpenAI is likely running intentional blind tests at this point.
Chestnut/Hazelnut went from Arena appearance to official release in a few weeks, so GPT-Image-2 likely lands around the May 12 DALL-E retirement date.