Anima 2B anime model (Cosmos-Predict2) hands-on: ~7GB VRAM, non-commercial license, and where the preview breaks

Update (2026-04-27): Pushing training to 12,000 steps backfires from overfitting; ep150-180 turns out to be the real sweet spot. Verification follow-up published. → Pushing WAI-Anima Character LoRA Training to the Official 12,000-Step Recommendation Made Direction Control Worse — Half That at ep150 Hit 100%

Update (2026-04-25): Published a follow-up on training a kanachan LoRA for the WAI-Anima v1 derivative with AnimaLoraToolkit on RunPod. → Training a WAI-Anima LoRA from a Cleaned 53-Image Dataset via AnimaLoraToolkit

Update (2026-04-16): Two months later, WAI0731 released “WAI-Anima v1.” The ecosystem around Anima has evolved, so I covered it in a separate article. → Was Looking for a New WAI-Illustrious Version and Found WAI-Anima Instead

I noticed ModelScope’s official account posting “Anima is Now Live on ModelScope!” so I looked into it. However, the post’s description read like an LLM description — “roleplay specialist,” “zero persona-drift in long-form dialogue,” and so on — completely mismatching the actual model. In reality, it is a diffusion model that generates images from text.

Model Overview

Item	Details
Developer	CircleStone Labs × Comfy Org
Parameters	2 billion (2B)
Base model	NVIDIA Cosmos-Predict2-2B-Text2Image
Text encoder	Qwen3 0.6B base
VAE	Qwen Image VAE
Training data	Several million anime images + ~800K non-anime art (no synthetic data)
Knowledge cutoff	September 2025 (anime data)
VRAM	~7GB (without quantization)
License	CircleStone Labs Non-Commercial License (non-commercial only)
Status	Preview (mid-training checkpoint)

Architecture Highlights

What makes this interesting is that it is based on NVIDIA’s Cosmos-Predict2, not an SDXL derivative. It is an entirely different lineage from SDXL-based anime models (NoobAI, Illustrious, Animagine).

That said, the text encoder is Qwen3 0.6B — quite small. Even typical lightweight models use around 4B as standard, so this is a significant constraint.

Recommended Settings

Setting	Value
Resolution	~1MP (1024×1024, 896×1152, etc.)
Steps	30–50
CFG	4–5
Supported environment	ComfyUI (native)

Prompt Format

Supports Danbooru tags, natural language, or a combination of both.

[quality tags] [1girl/1boy etc.] [character] [series] [artist] [general tags]

Artist specification requires the @artist_name prefix.

User Reception (Immediately After Release)

113 likes on Civitai, 37 discussion threads on Hugging Face — initial interest was high.

Positives

Lightweight: 7GB without quantization. Runs on consumer GPUs
Natural language prompts: Usable even without knowing Danbooru tags
New architecture: First anime model based on Cosmos-Predict2
LoRA training confirmed: Trainable at rank 32, 512px, 10GB VRAM

Issues

Slow inference: Reports of being 10x slower than SDXL on Tesla V100
Hands break: Especially prominent when using @artist_name tags
Weak text encoder: 0.6B cannot understand complex pose or composition instructions. Some point out it can only produce poses that exist as Danbooru tags
Default output is bland: No aesthetic tuning has been applied, so output without quality tags or artist specification looks flat
Weak at high resolution: A limitation of the preview version
No ControlNet support: The ecosystem does not exist yet
Poor at text rendering: Single words may appear, but sentences are impossible

Comparison with Existing Models

Item	Anima	NoobAI-XL	Illustrious-XL	Z-Image
Architecture	Cosmos-Predict2	SDXL derivative	SDXL derivative	S3-DiT
Parameters	2B	SDXL equivalent	SDXL equivalent	6B
Maturity	Preview	Stable	Stable	Stable
VRAM	~7GB	6–8GB	6–8GB	~20GB (BF16)
Speed	Slow	SDXL standard	SDXL standard	Fast (Turbo version available)
ControlNet	Not supported	Extensive	Extensive	Supported
LoRA ecosystem	Almost none	Massive	Massive	Growing
License	Non-commercial only	Open	Open	Apache 2.0

Two key takeaways:

Speed and ecosystem: SDXL-based NoobAI/Illustrious are fully mature with an incomparably large accumulation of LoRAs, ControlNets, and merged models. There is no reason to migrate to Anima right now
Text encoder limitations: Z-Image and FLUX.2 Klein are lightweight but pack sufficiently sized text encoders. 0.6B is fundamentally lacking in expressiveness

The Text Encoder Problem in Detail

A detailed Japanese technical review (dskjal.com) covers this well. Key points:

Anima “can only output poses that exist as Danbooru tags”
Writing “raising arms and looking left” in natural language does not reproduce compositions that do not exist as tags
FLUX.2 Klein and Z-Image, while also lightweight, can understand such free-form instructions to some degree

Although Anima advertises natural language prompt support, the text encoder lacks the capacity, effectively constraining generation to tag-based output.

Treatment on ModelScope

A model page exists on ModelScope and file downloads are available. However, no inference API, demo, or deployment features are provided. The “roleplay specialist” phrasing in ModelScope’s tweet is entirely off-base for an image generation model — most likely a template promotional blurb attached without verifying the model’s actual nature.

The direction of an anime-focused model on a new architecture is interesting. If more image generation models based on Cosmos-Predict2 emerge in the future, Anima has meaning as a pioneer.

However, as its “preview” label suggests, it currently has almost no practical advantages over SDXL-based models. Inference is 10x slower, hands break, the text encoder is weak, the ecosystem is nonexistent, and the license is non-commercial only. ComfyUI native support is a plus, but NoobAI and Illustrious run just fine on ComfyUI too.

It all depends on how much improves in the final release. Unless the inference speed and text encoder constraints are resolved, there is no motivation to migrate from existing models.