LingBot-World: Ant Group's open-source real-time world model

On January 29, 2026, Robbyant, an embodied intelligence company under Ant Group, open-sourced LingBot-World. It is a world model based on Wan 2.2 that can generate interactive video in real time from a single image.

What a world model is

Conventional video generators such as Sora, Kling, and Wan are designed to “make pretty videos.” A world model, on the other hand, is designed to “simulate the world.” It learns physics and causal relationships.

	Conventional video generation AI	World model
Core idea	Make pretty videos	Simulate the world
Physics	Can break for visual quality	Prioritizes physical consistency
Output	Finished video file	Interactive video stream
Main use	Content production	AI training data generation, simulation

LingBot-World is designed as a “digital sandbox” for embodied intelligence, robotics, autonomous driving, and game development.

Technical features

Real-time interaction

Generation throughput is about 16 FPS, with end-to-end latency under 1 second. While it is generating, you can control the character or camera with the keyboard and mouse. You can also change weather and style through text input.

In my recent roundup of video generation AI, I wrote that image-to-video workflows often leave the in-between motion up to the AI. A world model can control that motion directly through real-time interaction.

Long-term consistency

It can generate continuous, stable, lossless video for up to about 10 minutes. That is achieved through multi-stage training and parallelized acceleration. Traditional i2v models usually top out at a few seconds to around 16 seconds, so this is a major step forward.

Zero-shot generalization

It can generate video from a single real-world image or a single game screenshot without extra fine-tuning. Since it does not require scene-specific data collection, it can be used immediately in new environments.

The LingBot lineup

LingBot-World is the third model in Ant Group’s LingBot series. It is positioned as an extension of the AGI strategy from the digital domain toward physical perception.

Model	Role
LingBot-Depth	Spatial understanding
LingBot-VLA	Vision-Language-Action model, the robot’s “general brain”
LingBot-VA	Causal video-action model that predicts the future before acting
LingBot-World	World model, the model covered here

The intended use is to generate a virtual environment with LingBot-World before a robot moves in the real world and to use that environment for training data.

Apple Silicon support

At the moment, it is not supported. The official setup assumes NVIDIA GPUs, and the sample command uses an 8-GPU configuration (--nproc_per_node=8).

The base Wan 2.2 model itself has been shown to run on Apple Silicon in practice, such as on an M2 Max with 64GB or an M3 Max with 36GB. But:

You need GGUF format, because safetensors hit MPS compatibility errors
You need to adjust the workflow through ComfyUI
It is unclear whether LingBot-World’s own extensions, such as interactivity and long-form generation, work on MPS

Realistic options

Option	Difficulty	Notes
Use Wan 2.2 and compromise	Low	Run GGUF in ComfyUI; an M1 Max with 64GB should be fine
Use cloud GPUs	Low	Rent NVIDIA GPUs from RunPod, Vast.ai, and similar providers
Wait for official MPS support	-	Roadmap unknown

If you only want video generation and do not care about interactivity, running Wan 2.2 locally is more realistic. If you want to try LingBot-World’s real-time controls, you will need cloud GPUs.

Where to find it

GitHub - Robbyant/lingbot-world
Hugging Face - robbyant/lingbot-world-base-cam
arXiv paper
License: Apache 2.0