Tech 7 min read

FDM-1: trained on 11 million hours of video, with a 50x more efficient video encoder

IkesanContents

FDM-1, released by Standard Intelligence, is a general-purpose computer action foundation model trained on 11 million hours of screen recordings from the internet. They have published demos showing web service operation, CAD modeling, autonomous driving, and UI fuzzing, all at 30 fps.

FDM-1 is not an LLM, but it is AI. Both the video encoder and the action prediction model are deep learning neural networks trained on 11,000,000 hours of video data to learn operation patterns. However, unlike LLMs, there is no chain-of-thought reasoning or tool use — it operates entirely through video and actions. When this article refers to “tokens,” it means not BPE language tokens but feature representations compressed by the video encoder and discretized action representations of mouse movements and key presses.

About Standard Intelligence

Standard Intelligence (@si_pbc) is a Public Benefit Corporation based in San Francisco with a team of four: Neel Redkar, Yudhister Kumar, Devansh Pandey, and Galen Mead. They previously open-sourced Hertz-Dev, an 8.5B-parameter full-duplex conversational audio model that achieved 80ms theoretical / 120ms measured latency on a single RTX 4090.

Video encoder compression efficiency

At the core of FDM-1 is video compression efficiency. It can compress 2 hours of 30fps high-resolution video into 1M tokens. That works out to 2 hours x 30fps = 216,000 frames for 1M tokens, or roughly 4-5 tokens per frame. Existing VLMs consume hundreds to thousands of tokens per frame, making this a 50-100x efficiency gap.

Comparing the number of frames that can be processed within a 200k token context:

  • Gemini: ~775 frames
  • ChatGPT Computer Use: ~240 frames
  • Claude: ~162 frames
  • FDM-1: 1 hour 40 minutes of frames (at max context)

Even compared to Gemini, the difference is several hundred times. The ability to retain context over long UI operation sessions is fundamentally different from other approaches.

It is also important that this compression maintains sufficient fidelity to read text on screen. In computer operation, on-screen text is the primary information source; if compression destroys text readability, the model is useless. The specific encoder architecture (whether ViT-based, CNN-based, etc.) has not been disclosed.

Training pipeline

Training follows a 3-stage pipeline.

Stage 1: IDM training

An Inverse Dynamics Model (IDM) is trained on 40,000 hours of contractor-labeled screen recordings. The IDM estimates “the screen changed from A to B — what did the user do in between?” It infers mouse movements and key presses from the difference between two consecutive frames.

Stage 2: Auto-labeling

The trained IDM is then used to automatically label the remaining 11 million hours of video with action annotations. A key detail is the compute allocation for labeling: high-confidence actions (obvious clicks, etc.) are processed first, while ambiguous actions receive more computation.

The largest existing dataset in the computer operation domain was under 20 hours. FDM-1’s 11,000,000 hours represents a difference of over three orders of magnitude.

Stage 3: FDM training

Using the labeled video data, a Forward Dynamics Model (FDM) is trained. While the IDM reverse-engineers “what was done,” the FDM predicts “given the current screen and operation history, what should be done next.” The product name FDM-1 comes from this stage.

Masked diffusion model

Standard LLMs generate text by repeatedly predicting the next single token from left to right (autoregressive next-token prediction). FDM-1 does not use this approach. Instead, it uses masked diffusion: starting from a fully masked state and gradually unmasking positions in order of confidence. In text generation, this is known as MDLM (Masked Diffusion Language Model). FDM-1 applies it to action prediction.

Inference procedure:

  1. Input frames are interleaved with mask tokens
  2. The model predicts log-probabilities for each masked position
  3. The top-k highest-confidence predictions are selected and unmasked
  4. If masks remain, repeat

The advantage is dynamic compute allocation at inference time. Easy actions (straight cursor movements, etc.) converge in few iterations, while ambiguous actions (scenes with multiple plausible operations) receive more iterations. The philosophy of “concentrating resources where they are needed” is similar to Mixture of Experts (MoE), but where MoE changes which expert network to consult, masked diffusion changes how long the same model thinks. Inference uses a 16-step noise schedule and non-causally references all frames in the sequence simultaneously to predict actions.

Action tokenization

Mouse movement uses exponential binning: fine bins for small movements, coarse bins for large movements, in a 49-level encoding. This is the same idea as mu-law companding in audio compression — nonlinear quantization. It supports both pixel-precise operations and large movements to screen edges. Key presses and scrolling are handled as individual tokens.

Evaluation infrastructure

Standard Intelligence has built an evaluation system capable of running over 1 million rollouts per hour across 80,000 forked virtual machines. Model inference latency from screen capture to action decision is 11 milliseconds, and a single H100 (NVIDIA’s AI GPU) can simultaneously control 42 VMs. This infrastructure is designed for running reinforcement learning loops at high speed.

Application demos

CAD operation

A demo shows FDM-1 operating Blender (3D modeling software) to create a gear. It uses VM snapshots to save state and roll back on failure, improving accuracy through retries.

Autonomous driving

A demo of controlling a real car via a web interface, achieving block-loop driving in San Francisco streets. Fine-tuning data was under 1 hour, and a 50% accuracy improvement over the base model was reported.

UI fuzzing

Fuzzing is originally a technique for finding bugs by throwing abnormal data at a program en masse. UI fuzzing is the screen-operation version: trying operation patterns that humans would not normally perform to uncover defects.

In the banking app demo, FDM-1 detected a bug where the send button remained active immediately after a transfer completed, and rapid repeated taps drove the balance negative. A human tester would think “transfer done” and move on, but FDM-1 sees the screen state and presses any available button. Because it maintains context through continuous video, it excels at following deep chains of operations, making it well suited for discovering edge cases like this.

Benchmarks and release status

No official benchmark scores on OSWorld or similar have been published. Since the architecture is fundamentally different from screenshot-based existing Computer Use approaches, comparing them on the same benchmark may not be particularly meaningful.

There is no information on API access, pricing, or model weight releases. As of now, it remains at the demo and announcement stage.

Comparison with OpenAI VPT

The direction is the same as OpenAI’s Video PreTraining (VPT) from 2022, which was trained on Minecraft gameplay videos, but the scope differs significantly. VPT was specialized for a game environment with a maximum context of about 6 seconds, while FDM-1 targets general desktop operation with context spanning minutes to hours. Training data is not limited to games, covering CAD, finance, and general web browsing.

Thinking about gaming applications, FDM-1 would excel at games where you “look at the screen state and choose the next action.” It would pair well with command-selection games that require grinding, like Uma Musume or Gakuen Idolmaster. Since it operates purely from video, it is harder for anti-cheat systems to detect compared to conventional bots that use memory reading or API hooks. On the other hand, rhythm games requiring frame-precise timing or fighting games requiring instant reactions to opponent behavior would be difficult for a video-based learning approach.