Tech 3 min read

LingBot-World: Ant Group's open-source real-time world model

IkesanContents

On January 29, 2026, Robbyant, an embodied intelligence company under Ant Group, open-sourced LingBot-World. It is a world model based on Wan 2.2 that can generate interactive video in real time from a single image.

What a world model is

Conventional video generators such as Sora, Kling, and Wan are designed to “make pretty videos.” A world model, on the other hand, is designed to “simulate the world.” It learns physics and causal relationships.

Conventional video generation AIWorld model
Core ideaMake pretty videosSimulate the world
PhysicsCan break for visual qualityPrioritizes physical consistency
OutputFinished video fileInteractive video stream
Main useContent productionAI training data generation, simulation

LingBot-World is designed as a “digital sandbox” for embodied intelligence, robotics, autonomous driving, and game development.

Technical features

Real-time interaction

Generation throughput is about 16 FPS, with end-to-end latency under 1 second. While it is generating, you can control the character or camera with the keyboard and mouse. You can also change weather and style through text input.

In my recent roundup of video generation AI, I wrote that image-to-video workflows often leave the in-between motion up to the AI. A world model can control that motion directly through real-time interaction.

Long-term consistency

It can generate continuous, stable, lossless video for up to about 10 minutes. That is achieved through multi-stage training and parallelized acceleration. Traditional i2v models usually top out at a few seconds to around 16 seconds, so this is a major step forward.

Zero-shot generalization

It can generate video from a single real-world image or a single game screenshot without extra fine-tuning. Since it does not require scene-specific data collection, it can be used immediately in new environments.

The LingBot lineup

LingBot-World is the third model in Ant Group’s LingBot series. It is positioned as an extension of the AGI strategy from the digital domain toward physical perception.

ModelRole
LingBot-DepthSpatial understanding
LingBot-VLAVision-Language-Action model, the robot’s “general brain”
LingBot-VACausal video-action model that predicts the future before acting
LingBot-WorldWorld model, the model covered here

The intended use is to generate a virtual environment with LingBot-World before a robot moves in the real world and to use that environment for training data.

Apple Silicon support

At the moment, it is not supported. The official setup assumes NVIDIA GPUs, and the sample command uses an 8-GPU configuration (--nproc_per_node=8).

The base Wan 2.2 model itself has been shown to run on Apple Silicon in practice, such as on an M2 Max with 64GB or an M3 Max with 36GB. But:

  • You need GGUF format, because safetensors hit MPS compatibility errors
  • You need to adjust the workflow through ComfyUI
  • It is unclear whether LingBot-World’s own extensions, such as interactivity and long-form generation, work on MPS

Realistic options

OptionDifficultyNotes
Use Wan 2.2 and compromiseLowRun GGUF in ComfyUI; an M1 Max with 64GB should be fine
Use cloud GPUsLowRent NVIDIA GPUs from RunPod, Vast.ai, and similar providers
Wait for official MPS support-Roadmap unknown

If you only want video generation and do not care about interactivity, running Wan 2.2 locally is more realistic. If you want to try LingBot-World’s real-time controls, you will need cloud GPUs.

Where to find it