Tech 5 min read

Anima — A 2B Anime Image Generation Model Based on Cosmos-Predict2: Current State and Issues

IkesanContents

I noticed ModelScope’s official account posting “Anima is Now Live on ModelScope!” so I looked into it. However, the post’s description read like an LLM description — “roleplay specialist,” “zero persona-drift in long-form dialogue,” and so on — completely mismatching the actual model. In reality, it is a diffusion model that generates images from text.

Model Overview

ItemDetails
DeveloperCircleStone Labs × Comfy Org
Parameters2 billion (2B)
Base modelNVIDIA Cosmos-Predict2-2B-Text2Image
Text encoderQwen3 0.6B base
VAEQwen Image VAE
Training dataSeveral million anime images + ~800K non-anime art (no synthetic data)
Knowledge cutoffSeptember 2025 (anime data)
VRAM~7GB (without quantization)
LicenseCircleStone Labs Non-Commercial License (non-commercial only)
StatusPreview (mid-training checkpoint)

Architecture Highlights

What makes this interesting is that it is based on NVIDIA’s Cosmos-Predict2, not an SDXL derivative. It is an entirely different lineage from SDXL-based anime models (NoobAI, Illustrious, Animagine).

That said, the text encoder is Qwen3 0.6B — quite small. Even typical lightweight models use around 4B as standard, so this is a significant constraint.

SettingValue
Resolution~1MP (1024×1024, 896×1152, etc.)
Steps30–50
CFG4–5
Supported environmentComfyUI (native)

Prompt Format

Supports Danbooru tags, natural language, or a combination of both.

[quality tags] [1girl/1boy etc.] [character] [series] [artist] [general tags]

Artist specification requires the @artist_name prefix.

User Reception (Immediately After Release)

113 likes on Civitai, 37 discussion threads on Hugging Face — initial interest was high.

Positives

  • Lightweight: 7GB without quantization. Runs on consumer GPUs
  • Natural language prompts: Usable even without knowing Danbooru tags
  • New architecture: First anime model based on Cosmos-Predict2
  • LoRA training confirmed: Trainable at rank 32, 512px, 10GB VRAM

Issues

  • Slow inference: Reports of being 10x slower than SDXL on Tesla V100
  • Hands break: Especially prominent when using @artist_name tags
  • Weak text encoder: 0.6B cannot understand complex pose or composition instructions. Some point out it can only produce poses that exist as Danbooru tags
  • Default output is bland: No aesthetic tuning has been applied, so output without quality tags or artist specification looks flat
  • Weak at high resolution: A limitation of the preview version
  • No ControlNet support: The ecosystem does not exist yet
  • Poor at text rendering: Single words may appear, but sentences are impossible

Comparison with Existing Models

ItemAnimaNoobAI-XLIllustrious-XLZ-Image
ArchitectureCosmos-Predict2SDXL derivativeSDXL derivativeS3-DiT
Parameters2BSDXL equivalentSDXL equivalent6B
MaturityPreviewStableStableStable
VRAM~7GB6–8GB6–8GB~20GB (BF16)
SpeedSlowSDXL standardSDXL standardFast (Turbo version available)
ControlNetNot supportedExtensiveExtensiveSupported
LoRA ecosystemAlmost noneMassiveMassiveGrowing
LicenseNon-commercial onlyOpenOpenApache 2.0

Two key takeaways:

  1. Speed and ecosystem: SDXL-based NoobAI/Illustrious are fully mature with an incomparably large accumulation of LoRAs, ControlNets, and merged models. There is no reason to migrate to Anima right now
  2. Text encoder limitations: Z-Image and FLUX.2 Klein are lightweight but pack sufficiently sized text encoders. 0.6B is fundamentally lacking in expressiveness

The Text Encoder Problem in Detail

A detailed Japanese technical review (dskjal.com) covers this well. Key points:

  • Anima “can only output poses that exist as Danbooru tags”
  • Writing “raising arms and looking left” in natural language does not reproduce compositions that do not exist as tags
  • FLUX.2 Klein and Z-Image, while also lightweight, can understand such free-form instructions to some degree

Although Anima advertises natural language prompt support, the text encoder lacks the capacity, effectively constraining generation to tag-based output.

Treatment on ModelScope

A model page exists on ModelScope and file downloads are available. However, no inference API, demo, or deployment features are provided. The “roleplay specialist” phrasing in ModelScope’s tweet is entirely off-base for an image generation model — most likely a template promotional blurb attached without verifying the model’s actual nature.


The direction of an anime-focused model on a new architecture is interesting. If more image generation models based on Cosmos-Predict2 emerge in the future, Anima has meaning as a pioneer.

However, as its “preview” label suggests, it currently has almost no practical advantages over SDXL-based models. Inference is 10x slower, hands break, the text encoder is weak, the ecosystem is nonexistent, and the license is non-commercial only. ComfyUI native support is a plus, but NoobAI and Illustrious run just fine on ComfyUI too.

It all depends on how much improves in the final release. Unless the inference speed and text encoder constraints are resolved, there is no motivation to migrate from existing models.