LingBot-World: Ant Group's open-source real-time world model
Contents
On January 29, 2026, Robbyant, an embodied intelligence company under Ant Group, open-sourced LingBot-World. It is a world model based on Wan 2.2 that can generate interactive video in real time from a single image.
What a world model is
Conventional video generators such as Sora, Kling, and Wan are designed to “make pretty videos.” A world model, on the other hand, is designed to “simulate the world.” It learns physics and causal relationships.
| Conventional video generation AI | World model | |
|---|---|---|
| Core idea | Make pretty videos | Simulate the world |
| Physics | Can break for visual quality | Prioritizes physical consistency |
| Output | Finished video file | Interactive video stream |
| Main use | Content production | AI training data generation, simulation |
LingBot-World is designed as a “digital sandbox” for embodied intelligence, robotics, autonomous driving, and game development.
Technical features
Real-time interaction
Generation throughput is about 16 FPS, with end-to-end latency under 1 second. While it is generating, you can control the character or camera with the keyboard and mouse. You can also change weather and style through text input.
In my recent roundup of video generation AI, I wrote that image-to-video workflows often leave the in-between motion up to the AI. A world model can control that motion directly through real-time interaction.
Long-term consistency
It can generate continuous, stable, lossless video for up to about 10 minutes. That is achieved through multi-stage training and parallelized acceleration. Traditional i2v models usually top out at a few seconds to around 16 seconds, so this is a major step forward.
Zero-shot generalization
It can generate video from a single real-world image or a single game screenshot without extra fine-tuning. Since it does not require scene-specific data collection, it can be used immediately in new environments.
The LingBot lineup
LingBot-World is the third model in Ant Group’s LingBot series. It is positioned as an extension of the AGI strategy from the digital domain toward physical perception.
| Model | Role |
|---|---|
| LingBot-Depth | Spatial understanding |
| LingBot-VLA | Vision-Language-Action model, the robot’s “general brain” |
| LingBot-VA | Causal video-action model that predicts the future before acting |
| LingBot-World | World model, the model covered here |
The intended use is to generate a virtual environment with LingBot-World before a robot moves in the real world and to use that environment for training data.
Apple Silicon support
At the moment, it is not supported. The official setup assumes NVIDIA GPUs, and the sample command uses an 8-GPU configuration (--nproc_per_node=8).
The base Wan 2.2 model itself has been shown to run on Apple Silicon in practice, such as on an M2 Max with 64GB or an M3 Max with 36GB. But:
- You need GGUF format, because safetensors hit MPS compatibility errors
- You need to adjust the workflow through ComfyUI
- It is unclear whether LingBot-World’s own extensions, such as interactivity and long-form generation, work on MPS
Realistic options
| Option | Difficulty | Notes |
|---|---|---|
| Use Wan 2.2 and compromise | Low | Run GGUF in ComfyUI; an M1 Max with 64GB should be fine |
| Use cloud GPUs | Low | Rent NVIDIA GPUs from RunPod, Vast.ai, and similar providers |
| Wait for official MPS support | - | Roadmap unknown |
If you only want video generation and do not care about interactivity, running Wan 2.2 locally is more realistic. If you want to try LingBot-World’s real-time controls, you will need cloud GPUs.
Where to find it
- GitHub - Robbyant/lingbot-world
- Hugging Face - robbyant/lingbot-world-base-cam
- arXiv paper
- License: Apache 2.0