AnimeGamer: AI That Understands Game State to Generate Anime Videos

AnimeGamer, jointly developed by Tencent ARC Lab and City University of Hong Kong, is a model specialized for anime-style game simulation video generation. The research was accepted to ICCV 2025 and takes a fundamentally different approach from general-purpose video generators like Sora and Runway.

What makes it different

Typical video generation models are general tools for producing videos from text. AnimeGamer, by contrast, is specialized for generating video while understanding game state.

Aspect	AnimeGamer	Sora / Runway and similar models
Focus area	Life simulation for anime-style games	General-purpose video generation
State tracking	Integrates game-state prediction	None
Context handling	Explicitly incorporates previous frames	Weaker continuity between frames

Demo videos featuring Studio Ghibli characters

The published demos use characters from the following Studio Ghibli works:

Ponyo: a clip of Sosuke exploring a game world
Kiki’s Delivery Service: a clip of Kiki flying through the sky
Castle in the Sky: a clip featuring Pazu

One interesting detail is that it can generate videos where characters from different anime works appear in the same world. For example, it can produce a clip where Pazu flies on Kiki’s broom.

That said, this is not an interactive real-time game. You provide instructions in text, and the system generates matching video clips.

Character state management

Like a game system, AnimeGamer gives characters multiple status values:

Stamina: consumed by actions and recovered through rest
Social value: changes through interactions with other characters
Entertainment value: rises through enjoyable activities

These values affect how the simulation progresses and help maintain consistency.

Technical features

Next Game State Prediction

The model generates both character motion and state transitions at the same time. In an RPG-style setting, for example, it can track position, HP, and action state while generating the corresponding animation.

Maintaining contextual consistency

Past frame information is explicitly incorporated to preserve scene-level continuity. This helps reduce a common failure mode in video generation, where a character’s appearance changes midway through the clip. Elements like a purple car or a forest background stay consistent across multiple turns.

Using a multimodal LLM

The system combines text prompts with visual context from previous frames to recognize and generate actions. It is not just text-to-video. It tries to generate with an understanding of game-like context.

Trying it yourself

You can run a local Gradio demo. The requirements are fairly heavy, but if you are interested, it is available on GitHub.

python app.py