Qwen Max vs Claude vs Codex writing Anima prompts, tested with a 3-character LoRA: 60 images, no same-family advantage
Contents

Anima (Anima-Base) is a Qwen-DiT model and its text encoder is Qwen3. If an LLM rewrites my Japanese scene briefs into English prompts before generation, shouldn’t a Qwen LLM have a vocabulary affinity that Claude or Codex lack? That hypothesis came out of a chat, and since I already have a merged 3-character LoRA (trio_v2), I put it to a controlled test.
The contenders are Qwen3.7 Max, Claude, and Codex. Each got the same Japanese scene brief, wrote an English prompt for Anima, and the outputs were generated with identical settings and per-scene fixed seeds. Instead of judging by feel, I scored 60 images across 10 scenes × 3 LLMs × 2 brief styles.
Test environment
| Item | Value |
|---|---|
| Generation machine | M1 Max 64GB, local ComfyUI |
| Base model | Anima-Base v1.0 |
| Character LoRA | anima-trio-v2 ep143 (3 characters in one LoRA, weight 1.0) |
| Common generation settings | Turbo LoRA 8-step / er_sde / simple / cfg 1.0 / 1024×1024 |
| Seeds | Fixed per scene (same seed across all LLMs and brief variants) |
| Conversion LLMs | Qwen3.7 Max (OpenAI-compatible API via ModelScope) / Claude (claude -p) / Codex (codex exec) |
| Scenes | 10 (2 solo, 4 two-character, 4 three-character) |
| Briefs | 2 variants (freeform / strict formatter, described below) |
The generation side reuses the proven line from the trio LoRA article with no generation-side settings changed. Only the conversion LLM and the brief vary.
How the experiment is wired
The flow is simple. The same Japanese instruction plus the same brief goes to each of the three LLMs, which must return only {"positive": "...", "negative": "..."} as JSON. That goes to ComfyUI, and the resulting image is checked element by element against the scene instruction.
flowchart TD
A[10 Japanese scene briefs] --> B[Embedded into a shared brief]
B --> C1[Qwen3.7 Max<br/>OpenAI-compatible API]
B --> C2[Claude<br/>claude -p]
B --> C3[Codex<br/>codex exec]
C1 --> D[JSON with<br/>positive / negative]
C2 --> D
C3 --> D
D --> E[ComfyUI generation<br/>shared settings, per-scene seed]
E --> F[Scored element by element<br/>against training-set references]
A few fairness details mattered.
- Claude runs via
claude -pin a neutral directory outside the project, so it cannot see my working notes or the trio LoRA know-how in CLAUDE.md. Otherwise Claude alone would be cheating - Codex gets the prompt over stdin via
codex exec - Each LLM also writes its own negative prompt, so negative design counts as part of the comparison
The first brief (freeform) only teaches the model family and the canonical look of the three characters. Structure and phrasing are left to each LLM. The scenes were designed to cover the scoring axes I care about with multi-character LoRAs.
| ID | Instruction | Difficulty axis |
|---|---|---|
| s01 | Kana solo, classroom, chin on hand, gazing out the window, sailor uniform | Gaze |
| s02 | Kei left, Koharu right, holding hands, casual dresses | Left/right + contact |
| s03 | Three on a bench, only center Kana holds two soft-serve cones | Exclusive prop |
| s04 | Kana × Kei hug with height gap, Kana on tiptoes looking up | Contact + height |
| s05 | Koharu sweeping, Kana sneaking up from behind | Roles + front/behind |
| s06 | Summer festival in yukata, only Koharu holds a goldfish bag | Exclusive prop |
| s07 | Kei solo, rain, blue umbrella, looking back | Prop |
| s08 | Kana right, Koharu left, back to back with crossed arms | Left/right + contact |
| s09 | All three jumping in uniforms, all smiling | Same action ×3 |
| s10 | Kana and Koharu play a game, only Kei watches empty-handed from behind | Roles + exclusivity |
Scoring compares each image against the real training-set references. A prompt that reads correctly but doesn’t reach the image counts as a loss.
Easy scenes produce no separation
Solo scenes and simple left/right placement were cleared by all three systems. s07 (Kei solo, rain, blue umbrella, looking back) comes out nearly identical. Left to right, Qwen, Claude, Codex.
s08 (Kana right, Koharu left, crossed arms, annoyed, white background) also passed everywhere; only the literal back-to-back pose degrades into standing side by side for every system.
s02 passed too, with the differences only in how each LLM fills unspecified space. Qwen invents a street background, Claude keeps it white, Codex colors the dresses.
The first real gap appeared in s01. Only Claude got “gazing out the window” onto the canvas, because it translated the situation into composition tags, adding profile, looking away. Qwen and Codex wrote looking out the window literally and got a girl facing the viewer. The gaze only reached the image once the situation had been translated down into a composition tag the 0.6B encoder could act on.
Hard scenes expose each LLM’s habits
Two-character contact with a height gap starts separating the field. In s04 (Kana × Kei hug), the Qwen version lost Kei entirely — both girls came out brown-haired, effectively two Kanas hugging.
The prompts explain it. Qwen wrote a flat 2girls, kanachan, keichan, hugging, ... with thick pose clauses and zero per-character appearance description. Claude wrote kanachan is a petite girl with brown hair, side ponytail, ...; keichan is a tall girl with long blonde hair, ... — semicolon-separated per-character description blocks, the exact structure the trio article landed on for 2–3 characters, reinvented by Claude without being told.
s05 (Koharu sweeping, Kana sneaking up from behind) split on role assignment. Only Claude kept the cleaning tool exclusive to Koharu — though what she holds is a sudsy deck brush rather than the instructed broom. Qwen had Koharu with a deck brush and Kana with a bamboo broom, and Codex swapped the roles so Kana holds the broom. Nobody got the sneak — all six variants across the experiment place them side by side.
With three characters, identity loss begins. In s09 (all three jumping, same action), the Qwen version was the only one with a blonde girl left — an imperfect Kei with shortened hair, a red-drifted ribbon and Kana’s ahoge on her head — while Claude’s and Codex’s versions duplicated black-haired Koharu instead. The three prompts are nearly identical trigger-only lines, so this is closer to a coin flip than a prompt-quality verdict.
In s03 (only center Kana holds two cones), Claude’s version was the one that broke — Kana turned black-haired and Kei picked up a stray ahoge, while Qwen and Codex kept all three intact. The same LLM wins one scene and collapses in another under the same brief.
s10 (Kana and Koharu playing, only Kei watching empty-handed from behind) is the hardest role-assignment scene, and in freeform only Codex held the structure — the assignment is right, though its Kei is cropped at the head, so even this one isn’t a clean image. Claude’s version duplicated Koharu into a fourth girl and handed three controllers out.
A strict formatter brief with count locks
Round two reruns all 10 scenes with a brief that demotes the LLM from author to formatter. It lists hard prohibitions and forces a character-count lock plus negative entries for absent characters.
You are not a prompt author but a validating formatter.
Creation, improvement, summarizing, paraphrasing and completion are forbidden.
(model info identical to the freeform brief)
Prohibitions. One violation makes the output a failure:
- Do not add characters not present in the scene instruction
- Do not change the specified number of people
- Do not alter or omit specified left/right or front/back placement
- Do not alter or omit specified clothing, props or background
- Do not summarize, shorten or omit elements written in the scene
- Do not add or complete elements to make it "more natural" or "better"
- Do not write triggers of absent characters into the positive
The positive must include:
- A count lock (e.g. exactly two girls, only kanachan and keichan, no other people)
- Every element written in the scene instruction
The negative must include:
- Bans on wrong counts and extra people
- Bans on text, speech bubbles, sound effects
- Trigger words of characters absent from the scene
This worked clearly for identity retention. Claude’s collapsed s03 recovered on the same seed.
s10 improved across the board — every system now reproduces the structure of Kana and Koharu playing in front with Kei behind the sofa, and Claude’s four-girl output shrinks back to three.
Two side effects showed up.
First, creativity disappears. The s01 window gaze that Claude had won in freeform was lost by all three systems under the strict brief — the prohibitions suppress clever composition-tag translation like profile, leaving a literal enumeration.
Second, literalism backfires. In s05 the Qwen and Codex versions associated “cleaning” with maid outfits, changing clothes nobody specified.
And s09 stays unstable under the strict brief too. Claude’s version brings back a blonde, blue-eyed, blue-ribboned girl — but she wears Kana’s ahoge over shortened hair, closer to a blonde Kana than a full Kei. Qwen’s version loses the blonde it had kept in freeform, and Codex stays broken. None of the six s09 variants produced a fully on-model Kei; keeping three trigger-only characters distinct while they perform the same action flips on tiny word-order changes.
Two failure modes survive any prompt
Across all 60 images, two kinds of breakage never got fixed by any LLM or either brief, so I attribute them to the Anima + trio LoRA side.
The first is exclusive possession. “Only one girl holds it” or “only one girl doesn’t” lost every single time — s03’s cones, s06’s goldfish bag and s10’s controllers all get distributed to everyone. In s06 the bag even migrates to the wrong girl.
Even in the much-improved strict s10, the final step — Kei staying empty-handed — broke in all three systems, putting a controller in the spectator’s hands. The freeform s06 runs all handed bags to everyone as well. This matches the trio article’s cleaning-scene note that per-character action assignment is still weak with three characters at once.
The second is front/behind blocking. The s05 sneak-up-from-behind came out side-by-side in all six variants, and s08’s back-to-back weakens the same way. This lines up with the trio article’s boundary — stacked, separable contact works; spatial arrangements beyond it don’t. No amount of prompt rewording moved these two failure modes, which leaves ControlNet and inpainting as the remaining route.
No Qwen-to-Qwen affinity showed up
Across 20 scene conversions, not once did a Qwen3.7 Max prompt succeed where Claude’s and Codex’s failed in a way traceable to shared-family vocabulary. Every gap traced back to prompt structure — per-character description blocks, tag-level translation of composition, count locks — all of which any LLM can write.
The receiving text encoder being a 0.6B Qwen3 is, I think, the whole story. However Qwen-flavored the phrasing, a 0.6B encoder has no capacity to reward it; what reaches the image is only whether the prompt was short, concrete and structurally clean.
| System | Habit |
|---|---|
| Qwen3.7 Max | Writes the leanest tag lines. Clean images, but drops character descriptions and fine points like gaze and roles |
| Claude | Best at rescuing hard instructions (gaze translation, role retention), but also produced the one identity collapse and a double-JSON output once. Benefits most from the strict brief |
| Codex | Most stable character identity, the only freeform s10 clear. Never spectacular, never disastrous |