UI-TARS-1.5-7B: a vision AI agent that reached SOTA in GUI grounding

ByteDance’s Seed team released UI-TARS-1.5, and it is interesting. It is a vision-language model for GUI agents, and its ability to identify UI elements from screenshots is far ahead of OpenAI CUA and Claude 3.7.

Even the lightweight 7B model reaches 49.6% on ScreenSpotPro. That is roughly twice the score of OpenAI CUA at 23.4% and Claude 3.7 at 27.7%. A desktop app is also available, so it can run locally.

where it sits

There are two broad approaches to GUI agents:

approach	how it works	examples
accessibility tree	pull element information from the DOM or OS APIs	Playwright MCP, agent-browser
vision AI	look at screenshots and decide	Skyvern, UI-TARS

UI-TARS is a model specialized for the vision-AI approach. Instead of taking a general-purpose VLM and hoping it works on GUI tasks, it is designed and trained for GUI operation from the start.

architecture

It is based on Qwen2.5-VL and then trained for GUI-agent tasks.

item	value
base model	Qwen2.5-VL
parameters	7B lightweight / 72B largest
training	reinforcement learning + inference-time scaling

inference-time scaling

The model increases accuracy on complex tasks by spending more compute at inference time. That is the same general direction many modern agent systems are moving toward.

benchmarks

GUI grounding (ScreenSpotPro)

This benchmark measures how precisely a model can point at the right UI element on screen.

model	score
UI-TARS-1.5-7B	49.6%
Claude 3.7	27.7%
OpenAI CUA	23.4%

For a 7B model, that is a huge gap. The specialized training clearly pays off.

The largest UI-TARS-1.5 model reaches 61.6%.

computer operation (OSWorld)

This measures whether a model can actually complete tasks on a real OS.

model	score (100 steps)
UI-TARS-1.5 largest	42.5%
OpenAI CUA	36.4%
Claude 3.7	28.0%
UI-TARS-1.5-7B	27.5%

The 7B model does not beat the biggest model or OpenAI CUA here, but it is in the same ballpark as Claude 3.7.

games

It achieved a 100% success rate on 14 Poki browser games. It also scored an average of 0.42 on Minecraft tasks when using the thinking process.

other benchmarks

benchmark	score
Windows Agent Arena	42.1%
Online-Mind2web	75.8%
Android World	64.2%

how to use it

python package

pip install ui-tars

prompt templates

There are three templates for different use cases:

template	use case	actions
COMPUTER_USE	desktop, including Windows, Linux, and macOS	click, drag, keyboard, scroll
MOBILE_USE	Android mobile	long_press, open_app, press_home, press_back
GROUNDING	element identification only	coordinate output

code example

from ui_tars.action_parser import parse_action_to_structure_output

response = "Thought: click the button\nAction: click(start_box='(100,200)')"
parsed_dict = parse_action_to_structure_output(
    response,
    factor=1000,
    origin_resized_height=1080,
    origin_resized_width=1920,
    model_type="qwen25vl"
)

The model outputs absolute coordinates, so you need to convert them to the actual screen size.

desktop app

UI-TARS-desktop is also available as a GUI application.

features:

GUI operation in natural language
screenshot recognition and visual feedback
works on Windows, macOS, and browsers
local execution for privacy

Development is still active, with more than 1,100 commits.

caveats

coordinate handling is required because the model outputs absolute coordinates
the 7B model is weaker on complex tasks
vision AI still has limits on dynamic UIs and states that cannot be understood from screenshots alone

my take

It is clearly strong as a GUI-grounding model. The evidence suggests that a dedicated model works better than forcing a general VLM onto GUI tasks.

But real agent tasks need more than accurate pointing. They also need judgment about what to do. OSWorld shows that the 7B model is still not finished for complex tasks.

In the context of the E2E testing tools I looked at before, it would be interesting to plug UI-TARS into a Vision AI tool like Skyvern. As a fallback when accessibility trees are not available, a high-accuracy GUI-grounding model has real value.