UI-TARS-1.5-7B: a vision AI agent that reached SOTA in GUI grounding
Contents
ByteDance’s Seed team released UI-TARS-1.5, and it is interesting. It is a vision-language model for GUI agents, and its ability to identify UI elements from screenshots is far ahead of OpenAI CUA and Claude 3.7.
Even the lightweight 7B model reaches 49.6% on ScreenSpotPro. That is roughly twice the score of OpenAI CUA at 23.4% and Claude 3.7 at 27.7%. A desktop app is also available, so it can run locally.
where it sits
There are two broad approaches to GUI agents:
| approach | how it works | examples |
|---|---|---|
| accessibility tree | pull element information from the DOM or OS APIs | Playwright MCP, agent-browser |
| vision AI | look at screenshots and decide | Skyvern, UI-TARS |
UI-TARS is a model specialized for the vision-AI approach. Instead of taking a general-purpose VLM and hoping it works on GUI tasks, it is designed and trained for GUI operation from the start.
architecture
It is based on Qwen2.5-VL and then trained for GUI-agent tasks.
| item | value |
|---|---|
| base model | Qwen2.5-VL |
| parameters | 7B lightweight / 72B largest |
| training | reinforcement learning + inference-time scaling |
inference-time scaling
The model increases accuracy on complex tasks by spending more compute at inference time. That is the same general direction many modern agent systems are moving toward.
benchmarks
GUI grounding (ScreenSpotPro)
This benchmark measures how precisely a model can point at the right UI element on screen.
| model | score |
|---|---|
| UI-TARS-1.5-7B | 49.6% |
| Claude 3.7 | 27.7% |
| OpenAI CUA | 23.4% |
For a 7B model, that is a huge gap. The specialized training clearly pays off.
The largest UI-TARS-1.5 model reaches 61.6%.
computer operation (OSWorld)
This measures whether a model can actually complete tasks on a real OS.
| model | score (100 steps) |
|---|---|
| UI-TARS-1.5 largest | 42.5% |
| OpenAI CUA | 36.4% |
| Claude 3.7 | 28.0% |
| UI-TARS-1.5-7B | 27.5% |
The 7B model does not beat the biggest model or OpenAI CUA here, but it is in the same ballpark as Claude 3.7.
games
It achieved a 100% success rate on 14 Poki browser games. It also scored an average of 0.42 on Minecraft tasks when using the thinking process.
other benchmarks
| benchmark | score |
|---|---|
| Windows Agent Arena | 42.1% |
| Online-Mind2web | 75.8% |
| Android World | 64.2% |
how to use it
python package
pip install ui-tars
prompt templates
There are three templates for different use cases:
| template | use case | actions |
|---|---|---|
| COMPUTER_USE | desktop, including Windows, Linux, and macOS | click, drag, keyboard, scroll |
| MOBILE_USE | Android mobile | long_press, open_app, press_home, press_back |
| GROUNDING | element identification only | coordinate output |
code example
from ui_tars.action_parser import parse_action_to_structure_output
response = "Thought: click the button\nAction: click(start_box='(100,200)')"
parsed_dict = parse_action_to_structure_output(
response,
factor=1000,
origin_resized_height=1080,
origin_resized_width=1920,
model_type="qwen25vl"
)
The model outputs absolute coordinates, so you need to convert them to the actual screen size.
desktop app
UI-TARS-desktop is also available as a GUI application.
features:
- GUI operation in natural language
- screenshot recognition and visual feedback
- works on Windows, macOS, and browsers
- local execution for privacy
Development is still active, with more than 1,100 commits.
caveats
- coordinate handling is required because the model outputs absolute coordinates
- the 7B model is weaker on complex tasks
- vision AI still has limits on dynamic UIs and states that cannot be understood from screenshots alone
my take
It is clearly strong as a GUI-grounding model. The evidence suggests that a dedicated model works better than forcing a general VLM onto GUI tasks.
But real agent tasks need more than accurate pointing. They also need judgment about what to do. OSWorld shows that the 7B model is still not finished for complex tasks.
In the context of the E2E testing tools I looked at before, it would be interesting to plug UI-TARS into a Vision AI tool like Skyvern. As a fallback when accessibility trees are not available, a high-accuracy GUI-grounding model has real value.