Tech 4 min read

Holotron-12B Makes PC-Operation AI 1.7× Faster, and Unsloth Studio Lets You Tune Models Without Code

Efficiency in AI models is advancing along two fronts at once: designing models themselves to run faster, and making model tuning and operations easier. This week brought noteworthy announcements on both.

H Company’s Holotron-12B uses a new memory-efficient architecture that significantly boosts the throughput of PC-operation agents. Unsloth has released the beta of Studio, a browser-based tool that lets you fine-tune models on your own data—no code required.

Holotron-12B: Speeding up PC-operation AI by improving memory efficiency

H Company has released Holotron-12B, a PC-operations–focused model based on NVIDIA’s Nemotron that adopts a new design to reduce memory usage.

Why memory becomes a bottleneck

Standard Transformer models keep the entire conversation history in memory. As the dialog grows, the key–value cache (the model’s “memory”) grows with it.

For PC-operation agents this is especially problematic. They must retain an action history—“which button did I press earlier? what did the screen look like three steps ago?”—while also processing the current screen at high resolution. The memory footprint quickly balloons.

Holotron-12B addresses this by combining in a State Space Model (SSM). Because an SSM compresses past information into a fixed-size state, memory usage stays constant even as interactions get longer.

MethodMemory growth
Conventional (Transformer)Increases as the interaction gets longer
SSM onlyConstant (keeps a compressed history)
Hybrid (Holotron)Compress long history with SSM + use Transformer for recent, fine-grained decisions

Using only an SSM can be coarse for very recent context like “the immediately preceding action.” Holotron splits responsibilities: the SSM stores long-term history, while the Transformer handles short-term, fine-grained decisions.

How much smarter did it get

On WebVoyager (a benchmark that has models operate real websites), accuracy improved from 35.1% to 80.5%. It maintains parity with the previous-generation Holo2-8B in capability while significantly improving throughput.

Grounding—accurately locating buttons and text fields on the screen—was also reported as much improved across several tests, though specific numbers were not disclosed.

Throughput comparison

Using the same GPU (1× H100), the results are:

ModelMax throughput
Holo2-8B (previous gen)5,100 tokens/sec
Holotron-12B8,900 tokens/sec

The prior generation saturated early as concurrency increased, while Holotron-12B continues to scale nearly linearly with the number of concurrent jobs. The difference shows up when running many tasks in parallel.

Training setup

Starting from NVIDIA’s Nemotron, H Company performed additional training on roughly 1.4 billion tokens of data it collected. The training data comprises three task types: understanding on-screen content, localizing UI elements, and manipulating the UI.

The model is released under the NVIDIA Open Model License and is available on Hugging Face. The next generation aims for further efficiency improvements and a stable enterprise-ready release.

Unsloth Studio: Tune AI models to your needs without writing code

Unsloth has beta-released Studio, an open-source tool that lets you handle fine-tuning, evaluation, and export entirely in the browser through point-and-click UI.

Unsloth itself is known as a Python library that makes fine-tuning “2× faster with 70% less memory.” However, it traditionally required the command line or Jupyter Notebook, and thus programming familiarity. Studio brings the experience to a browser UI.

Key features

It supports 500+ models across text, image understanding, and speech synthesis.

FeatureRequirements
Fine-tuningNVIDIA GPU (RTX 30/40/50 series)
Inference (evaluation)CPU-only OK; Mac supported
Fine-tuning on MacIn development (coming soon)
  • Data set auto-generation: Upload PDF/CSV/JSON/DOCX/TXT and Studio automatically builds a training dataset. Think “turn internal documents straight into training data.”
  • Model Arena: Load two models side by side and compare their answers to the same prompt.
  • Export: Write out fine-tuned models in formats supported by major tools like Ollama and LM Studio.

Runs entirely locally

Studio is designed for “100% offline, local operation.” Because data is not sent externally, it’s suitable for fine-tuning on sensitive internal documents.

At the moment, Mac is supported for inference only; Mac-based fine-tuning is listed as coming soon.

Previously, a workflow combined Unsloth + Jupyter + Ollama and other tools; now the tasks come together in a single UI. It’s still a beta so rough edges are likely, but for anyone who wants to tailor a model with their own data without writing code, Studio looks like a good fit.