Rust async/await Runs on GPUs as VectorWare Demonstrates the First Implementation
Contents
VectorWare has announced the first implementation of Rust’s Future trait and async/await running on GPUs. It is an ambitious attempt to bring a structured programming model to GPU concurrency.
Current pain points in GPU programming
GPUs have traditionally been optimized for data parallelism. More complex programs often rely on warp specialization, where different warps run different tasks concurrently, but that approach requires manual coordination and synchronization and is easy to get wrong.
Existing frameworks such as JAX, Triton, and CUDA Tile introduce concepts like computation graphs or blocks, but they require developers to learn a new programming paradigm, which raises the adoption cost.
Why Rust async/await
Rust’s Future trait has several properties that make it appealing for GPU concurrency.
- Minimal design: the
pollmethod only returnsReadyorPending, which makes it portable across very different execution environments - Lazy evaluation: programs can be constructed as values before execution, allowing the compiler to analyze dependencies
- Ownership model: data sharing and transfer are enforced at the type level, enabling safer concurrency even on GPUs
Implementation details
The first step was a simple block_on executor, essentially a naive implementation that repeatedly polls a single future until it completes.
After that, VectorWare adapted the Embassy executor, originally designed for embedded systems, to a GPU no_std environment. According to the post, only minimal changes were needed, which says a lot about the generality of Embassy’s design.
The demo currently supports:
- Simple async functions
- Workflows with multiple await points
- Async inside conditional branches
- Async blocks
- Integration with the
futures_utillibrary - Concurrent execution of three independent tasks
Current limitations
The article is also clear about the current constraints.
- Cooperative scheduling: if a future does not yield, other work can starve
- No interrupts: GPUs lack hardware interrupts, so spin loops are required
- Register pressure: maintaining scheduling state can consume registers and reduce occupancy
- Function coloring: the same kind of constraint familiar from CPUs also appears on GPUs
There are a lot of limitations here, but it is still interesting that the minimal design of the Future trait works on GPUs almost as-is. As AI workloads grow more complex, this feels like a meaningful experiment in rethinking GPU concurrency models.