Tech 4 min read

UI-TARS-1.5-7B: a vision AI agent that reached SOTA in GUI grounding

IkesanContents

ByteDance’s Seed team released UI-TARS-1.5, and it is interesting. It is a vision-language model for GUI agents, and its ability to identify UI elements from screenshots is far ahead of OpenAI CUA and Claude 3.7.

Even the lightweight 7B model reaches 49.6% on ScreenSpotPro. That is roughly twice the score of OpenAI CUA at 23.4% and Claude 3.7 at 27.7%. A desktop app is also available, so it can run locally.

where it sits

There are two broad approaches to GUI agents:

approachhow it worksexamples
accessibility treepull element information from the DOM or OS APIsPlaywright MCP, agent-browser
vision AIlook at screenshots and decideSkyvern, UI-TARS

UI-TARS is a model specialized for the vision-AI approach. Instead of taking a general-purpose VLM and hoping it works on GUI tasks, it is designed and trained for GUI operation from the start.

architecture

It is based on Qwen2.5-VL and then trained for GUI-agent tasks.

itemvalue
base modelQwen2.5-VL
parameters7B lightweight / 72B largest
trainingreinforcement learning + inference-time scaling

inference-time scaling

The model increases accuracy on complex tasks by spending more compute at inference time. That is the same general direction many modern agent systems are moving toward.

benchmarks

GUI grounding (ScreenSpotPro)

This benchmark measures how precisely a model can point at the right UI element on screen.

modelscore
UI-TARS-1.5-7B49.6%
Claude 3.727.7%
OpenAI CUA23.4%

For a 7B model, that is a huge gap. The specialized training clearly pays off.

The largest UI-TARS-1.5 model reaches 61.6%.

computer operation (OSWorld)

This measures whether a model can actually complete tasks on a real OS.

modelscore (100 steps)
UI-TARS-1.5 largest42.5%
OpenAI CUA36.4%
Claude 3.728.0%
UI-TARS-1.5-7B27.5%

The 7B model does not beat the biggest model or OpenAI CUA here, but it is in the same ballpark as Claude 3.7.

games

It achieved a 100% success rate on 14 Poki browser games. It also scored an average of 0.42 on Minecraft tasks when using the thinking process.

other benchmarks

benchmarkscore
Windows Agent Arena42.1%
Online-Mind2web75.8%
Android World64.2%

how to use it

python package

pip install ui-tars

prompt templates

There are three templates for different use cases:

templateuse caseactions
COMPUTER_USEdesktop, including Windows, Linux, and macOSclick, drag, keyboard, scroll
MOBILE_USEAndroid mobilelong_press, open_app, press_home, press_back
GROUNDINGelement identification onlycoordinate output

code example

from ui_tars.action_parser import parse_action_to_structure_output

response = "Thought: click the button\nAction: click(start_box='(100,200)')"
parsed_dict = parse_action_to_structure_output(
    response,
    factor=1000,
    origin_resized_height=1080,
    origin_resized_width=1920,
    model_type="qwen25vl"
)

The model outputs absolute coordinates, so you need to convert them to the actual screen size.

desktop app

UI-TARS-desktop is also available as a GUI application.

features:

  • GUI operation in natural language
  • screenshot recognition and visual feedback
  • works on Windows, macOS, and browsers
  • local execution for privacy

Development is still active, with more than 1,100 commits.

caveats

  • coordinate handling is required because the model outputs absolute coordinates
  • the 7B model is weaker on complex tasks
  • vision AI still has limits on dynamic UIs and states that cannot be understood from screenshots alone

my take

It is clearly strong as a GUI-grounding model. The evidence suggests that a dedicated model works better than forcing a general VLM onto GUI tasks.

But real agent tasks need more than accurate pointing. They also need judgment about what to do. OSWorld shows that the 7B model is still not finished for complex tasks.

In the context of the E2E testing tools I looked at before, it would be interesting to plug UI-TARS into a Vision AI tool like Skyvern. As a fallback when accessibility trees are not available, a high-accuracy GUI-grounding model has real value.