Anima — A 2B Anime Image Generation Model Based on Cosmos-Predict2: Current State and Issues
Contents
I noticed ModelScope’s official account posting “Anima is Now Live on ModelScope!” so I looked into it. However, the post’s description read like an LLM description — “roleplay specialist,” “zero persona-drift in long-form dialogue,” and so on — completely mismatching the actual model. In reality, it is a diffusion model that generates images from text.
Model Overview
| Item | Details |
|---|---|
| Developer | CircleStone Labs × Comfy Org |
| Parameters | 2 billion (2B) |
| Base model | NVIDIA Cosmos-Predict2-2B-Text2Image |
| Text encoder | Qwen3 0.6B base |
| VAE | Qwen Image VAE |
| Training data | Several million anime images + ~800K non-anime art (no synthetic data) |
| Knowledge cutoff | September 2025 (anime data) |
| VRAM | ~7GB (without quantization) |
| License | CircleStone Labs Non-Commercial License (non-commercial only) |
| Status | Preview (mid-training checkpoint) |
Architecture Highlights
What makes this interesting is that it is based on NVIDIA’s Cosmos-Predict2, not an SDXL derivative. It is an entirely different lineage from SDXL-based anime models (NoobAI, Illustrious, Animagine).
That said, the text encoder is Qwen3 0.6B — quite small. Even typical lightweight models use around 4B as standard, so this is a significant constraint.
Recommended Settings
| Setting | Value |
|---|---|
| Resolution | ~1MP (1024×1024, 896×1152, etc.) |
| Steps | 30–50 |
| CFG | 4–5 |
| Supported environment | ComfyUI (native) |
Prompt Format
Supports Danbooru tags, natural language, or a combination of both.
[quality tags] [1girl/1boy etc.] [character] [series] [artist] [general tags]
Artist specification requires the @artist_name prefix.
User Reception (Immediately After Release)
113 likes on Civitai, 37 discussion threads on Hugging Face — initial interest was high.
Positives
- Lightweight: 7GB without quantization. Runs on consumer GPUs
- Natural language prompts: Usable even without knowing Danbooru tags
- New architecture: First anime model based on Cosmos-Predict2
- LoRA training confirmed: Trainable at rank 32, 512px, 10GB VRAM
Issues
- Slow inference: Reports of being 10x slower than SDXL on Tesla V100
- Hands break: Especially prominent when using
@artist_nametags - Weak text encoder: 0.6B cannot understand complex pose or composition instructions. Some point out it can only produce poses that exist as Danbooru tags
- Default output is bland: No aesthetic tuning has been applied, so output without quality tags or artist specification looks flat
- Weak at high resolution: A limitation of the preview version
- No ControlNet support: The ecosystem does not exist yet
- Poor at text rendering: Single words may appear, but sentences are impossible
Comparison with Existing Models
| Item | Anima | NoobAI-XL | Illustrious-XL | Z-Image |
|---|---|---|---|---|
| Architecture | Cosmos-Predict2 | SDXL derivative | SDXL derivative | S3-DiT |
| Parameters | 2B | SDXL equivalent | SDXL equivalent | 6B |
| Maturity | Preview | Stable | Stable | Stable |
| VRAM | ~7GB | 6–8GB | 6–8GB | ~20GB (BF16) |
| Speed | Slow | SDXL standard | SDXL standard | Fast (Turbo version available) |
| ControlNet | Not supported | Extensive | Extensive | Supported |
| LoRA ecosystem | Almost none | Massive | Massive | Growing |
| License | Non-commercial only | Open | Open | Apache 2.0 |
Two key takeaways:
- Speed and ecosystem: SDXL-based NoobAI/Illustrious are fully mature with an incomparably large accumulation of LoRAs, ControlNets, and merged models. There is no reason to migrate to Anima right now
- Text encoder limitations: Z-Image and FLUX.2 Klein are lightweight but pack sufficiently sized text encoders. 0.6B is fundamentally lacking in expressiveness
The Text Encoder Problem in Detail
A detailed Japanese technical review (dskjal.com) covers this well. Key points:
- Anima “can only output poses that exist as Danbooru tags”
- Writing “raising arms and looking left” in natural language does not reproduce compositions that do not exist as tags
- FLUX.2 Klein and Z-Image, while also lightweight, can understand such free-form instructions to some degree
Although Anima advertises natural language prompt support, the text encoder lacks the capacity, effectively constraining generation to tag-based output.
Treatment on ModelScope
A model page exists on ModelScope and file downloads are available. However, no inference API, demo, or deployment features are provided. The “roleplay specialist” phrasing in ModelScope’s tweet is entirely off-base for an image generation model — most likely a template promotional blurb attached without verifying the model’s actual nature.
The direction of an anime-focused model on a new architecture is interesting. If more image generation models based on Cosmos-Predict2 emerge in the future, Anima has meaning as a pioneer.
However, as its “preview” label suggests, it currently has almost no practical advantages over SDXL-based models. Inference is 10x slower, hands break, the text encoder is weak, the ecosystem is nonexistent, and the license is non-commercial only. ComfyUI native support is a plus, but NoobAI and Illustrious run just fine on ComfyUI too.
It all depends on how much improves in the final release. Unless the inference speed and text encoder constraints are resolved, there is no motivation to migrate from existing models.
Related Articles
- Z-Image — Alibaba’s Image Generation AI Said to Surpass FLUX
- BEYOND_REALITY_Z_IMAGE — A Photorealistic Portrait Model Based on Z-Image Turbo
- FLUX.2 Klein — A 9B Parameter Lightweight Image Generation Model and Apple Silicon Compatibility
- Reproducing NovelAI’s Precise Reference Locally