Rust async/await Runs on GPUs as VectorWare Demonstrates the First Implementation

VectorWare has announced the first implementation of Rust’s Future trait and async/await running on GPUs. It is an ambitious attempt to bring a structured programming model to GPU concurrency.

Current pain points in GPU programming

GPUs have traditionally been optimized for data parallelism. More complex programs often rely on warp specialization, where different warps run different tasks concurrently, but that approach requires manual coordination and synchronization and is easy to get wrong.

Existing frameworks such as JAX, Triton, and CUDA Tile introduce concepts like computation graphs or blocks, but they require developers to learn a new programming paradigm, which raises the adoption cost.

Why Rust async/await

Rust’s Future trait has several properties that make it appealing for GPU concurrency.

Minimal design: the poll method only returns Ready or Pending, which makes it portable across very different execution environments
Lazy evaluation: programs can be constructed as values before execution, allowing the compiler to analyze dependencies
Ownership model: data sharing and transfer are enforced at the type level, enabling safer concurrency even on GPUs

Implementation details

The first step was a simple block_on executor, essentially a naive implementation that repeatedly polls a single future until it completes.

After that, VectorWare adapted the Embassy executor, originally designed for embedded systems, to a GPU no_std environment. According to the post, only minimal changes were needed, which says a lot about the generality of Embassy’s design.

The demo currently supports:

Simple async functions
Workflows with multiple await points
Async inside conditional branches
Async blocks
Integration with the futures_util library
Concurrent execution of three independent tasks

Current limitations

The article is also clear about the current constraints.

Cooperative scheduling: if a future does not yield, other work can starve
No interrupts: GPUs lack hardware interrupts, so spin loops are required
Register pressure: maintaining scheduling state can consume registers and reduce occupancy
Function coloring: the same kind of constraint familiar from CPUs also appears on GPUs

There are a lot of limitations here, but it is still interesting that the minimal design of the Future trait works on GPUs almost as-is. As AI workloads grow more complex, this feels like a meaningful experiment in rethinking GPU concurrency models.

Async/Await on the GPU - VectorWare