Tech 13 min read

Zero-copy GPU inference on Apple Silicon with WebAssembly and Metal

IkesanContents

Abacus Noir’s blog published the design of an inference runtime called Driftwood.
It’s a walkthrough of a zero-copy inference chain in which a WebAssembly module and the Apple Silicon GPU read and write the same physical memory, connecting mmap, Metal’s bytesNoCopy buffer, and a Wasmtime custom allocator to prove — at the measurement level — that “the CPU and the GPU read and write identical physical bytes.”

It sounds unremarkable, but this is the basic building block for treating Wasm as a control plane and the GPU as a compute plane on Apple Silicon, so let me walk through how it works.

UMA and discrete GPUs give “zero-copy” different meanings

With a discrete GPU (a PCIe-attached external GPU), host memory and GPU memory are physically separate, so every inference step incurs a transfer across the PCIe bus.
Every matrix or embedding vector handed to the GPU triggers a copy, which costs both latency and memory.

Apple Silicon uses UMA (Unified Memory Architecture), where the CPU and GPU share the same DRAM chips.
In theory no copy should be needed, but if you call Metal’s makeBuffer(length:options:) normally, Metal allocates its own managed memory region internally.
The result: you still end up copying from your host-side buffer into the GPU-side buffer.

Driftwood exploits this: it forces the Wasm runtime’s linear memory and the Metal buffer the GPU touches to match at the physical address level.
When the pointer is the same, the overhead of passing arguments from a host-function call into the GPU disappears.

This sits on the same trajectory as earlier coverage, like Ollama’s MLX backend migration or SwiftLM’s Metal-native implementation, of Apple Silicon-specific optimization.
The difference is that a Wasm layer is inserted as a portable control plane.

The implementation consists of three links.

flowchart TD
    A[Wasm module<br/>linear memory] -->|same pointer| B[mmap region<br/>16KB-aligned pages]
    B -->|same pointer| C[MTLBuffer<br/>bytesNoCopy wrap]
    C -->|GPU kernel run| D[Metal GPU<br/>reads/writes same physical bytes]
    E[Wasmtime<br/>MemoryCreator trait] -->|swaps linear memory| A
    F[LLM kernel<br/>GEMM etc.] -->|compute pipeline| D

On ARM64 macOS, mmap(MAP_ANON | MAP_PRIVATE) returns a 16KB-aligned address.
That 16KB is ARM64’s page size, and POSIX guarantees mmap returns a page-aligned pointer.
This isn’t accidental alignment — the spec requires it.

Why does alignment matter? Because Metal’s newBufferWithBytesNoCopy, used in the next step, only accepts page-aligned pointers.
Memory from malloc lands on arbitrary byte boundaries, so it can’t be passed to that function.
You must use vm_allocate or mmap.

MTLDevice.makeBuffer(bytesNoCopy:length:options:deallocator:) is an API that wraps an existing host pointer as an MTLBuffer without copying.
Metal does not allocate a separate region internally; the GPU accesses the physical memory pointed to by the pointer you pass.

Driftwood verifies that the pointer returned by MTLBuffer.contents() matches the original pointer from mmap.
This is the measurement used to show that “Metal is not secretly copying behind the wrap.”
The implementation is a simple == comparison, and you can cross-check it by measuring the RSS (Resident Set Size — the process’s actual physical memory footprint) delta.

As a measured number: wrapping a 16MB buffer with bytesNoCopy increases RSS by 0.03MB (noise level), whereas the copy path adds 16.78MB.
With the copy path you pay double the physical memory and need an explicit transfer every time you want host-side writes visible on the GPU.
The numbers show a clear gap in favor of the zero-copy path.

Wasmtime has a trait (Rust-speak for an interface) called MemoryCreator that lets the runtime substitute the WebAssembly linear memory — the same memory a module references through memory.data_ptr().
If you implement the MemoryCreator/LinearMemory pair, Wasmtime stops calling mmap internally and uses the memory you hand it.

If you pass the earlier mmap region as the backing for the linear memory, any byte the Wasm module writes is immediately visible to Metal.
The pointer returned by memory.data_ptr() is the same address you originally got from mmap.
Wasm, CPU, and GPU all point to the same physical bytes.

There is a catch, though: the MemoryCreator takes a guard_size_in_bytes parameter that specifies how many bytes of guard page (memory that SEGVs on access) sit at the tail of the region.
Wasmtime’s JIT code assumes this guard size and omits bounds checks accordingly, so if you hand it a region without any guard area, the JIT can walk into illegal memory.
The standard approach is to mmap body + guard at allocation time and mprotect the guard portion to PROT_NONE.

Measured performance

Here are the numbers Abacus Noir publishes.

MetricValue
GEMM (128x128) latency~6.75ms
Pointer identitymmap == MTLBuffer.contents()
RSS delta for 16MB allocation0.03MB (zero-copy path)
Correctness (16,384 elements)0 errors

There are model-inference numbers too.
Running Llama 3.2 1B on an M1 MacBook Pro: 229ms to load, 106ms to prefill 5 tokens, ~9ms per generated token.
The overhead of a host function call (the path a Wasm module uses to launch a Metal kernel) is described as “negligible.”

Running a 1B-class model at 9ms/token under Wasm control matches the level of a native Swift implementation like SwiftLM.
It shows that inserting Wasm into the stack carries effectively no performance penalty.

safetensors-based KV cache

On top of the zero-copy foundation, Driftwood adds actor snapshots — freezing and restoring conversation state.
The key design choice is serializing the KV cache using the safetensors format: serializing 24 tokens (1.58MB) takes 1.1ms, restoring takes 1.4ms.

The noteworthy number is “5.45× faster than re-prefilling.”
Instead of re-computing every token from scratch to restore a conversation, you read the KV cache back in.
Post-restoration token generation is bit-identical 10/10.
That’s the precision you need to make long-context agents resume instantly, or to move state across machines.

safetensors is the de facto tensor serialization format in the ML world and avoids the arbitrary code execution risk of pickle.
Using the same format for runtime state like the KV cache is a sensible choice.

What you actually gain

If this design holds, there are wins on the operational, performance, and distribution axes.

On portability: Wasm is a sandboxed execution environment, easy to distribute across machines and OSes.
Write the inference logic in Wasm, and you can standardize distribution and execution across Apple Silicon machines.

On overhead: normally, stitching together Wasm/JIT/GPU introduces costs at the serialization boundaries.
UMA plus zero-copy erases those boundaries, so the cost of inserting Wasm becomes essentially zero.

On state management: making the KV cache snapshotable via safetensors lets you pause, resume, or migrate a conversational agent across machines.
That’s directly useful for running agent-style applications.

On discrete GPUs (NVIDIA H100, AMD MI300, etc.) there are equivalents of bytesNoCopy, but physical memory really is separate, so “reading and writing the same bytes” isn’t possible there.
This is a UMA-specific design and doesn’t port as-is to other platforms.

Gotchas in the implementation

Here are the points the original post and related documentation call out.

malloc-family allocations can’t be passed to bytesNoCopy.
Apple’s official forum explicitly says you need vm_allocate or mmap.
malloc implementations return arbitrary alignments, so you won’t land on page boundaries.

Don’t skip the Wasmtime guard page setup.
Without the guard, JIT code can silently do out-of-bounds accesses.
The standard pattern is to over-allocate with mmap and mprotect the tail.

The ty argument to MemoryCreator takes minimum/maximum in Wasm page units (64KB).
Don’t confuse that with the ARM64 OS page size (16KB) — there’s a 4× difference.

Don’t forget deallocator responsibility.
bytesNoCopy takes a deallocator closure, so the design is to have munmap called when the MTLBuffer is released.
Both sides hold ownership that crosses over, so you need to be explicit about who releases what.

Where this sits in Apple Silicon optimization

The zero-copy chain itself is low-profile, but you can place it in context by looking at the optimization of the AI inference stack.
Ollama 0.19 migrating to an MLX backend leaned into Apple Silicon’s UMA, Flash-MoE running 397B parameters on a MacBook combined Metal shaders with SSD expert streaming, and SwiftLM natively integrating TurboQuant + SSD streaming + Metal continued the trend of peeling off overhead layer by layer.

Driftwood extends this trajectory by showing that “slotting Wasm in as a control plane doesn’t hurt performance.”
It’s a design that gets both portability and performance — normally a tradeoff — using UMA as the underlying hardware property to land both.

On a different axis from Apple MPS support coming through Python ecosystem PRs like transformers-to-mlx, here’s an approach of rebuilding inference runtimes from the systems-programming level.
Both should eventually converge on exploiting Apple Silicon’s physical characteristics.

The Driftwood runtime itself is still under development and isn’t open source yet, but the design document released now reads as a concrete step-by-step guide for anyone wanting to build a similar system.
It’s a solid starting point for putting together an inference system with Wasmtime and Metal.

Does this work for non-LLM workloads too?

The demo uses Llama 3.2 1B, but the three-link chain doesn’t depend on anything LLM-specific.
All you need is the basic GPGPU pattern: “the GPU kernel reads and writes a host-allocated buffer and returns the result to the host.”
If an LLM-style GEMM (matrix multiply) kernel works, the same setup applies to the following workloads in principle.

For image/video inference, candidates are CNN-based object detection (YOLO, DETR variants) and image classification.
The pattern — passing input and output tensors zero-copy — is the same, so any model architecture benefits.
UNet-based diffusion models like Stable Diffusion fit the same mold.

For OCR and document analysis, the 0.9B-class VLMs in pipelines like NDLOCR-Lite-style layout detection + recognition, PaddleOCR-VL-1.5, and GLM-OCR are Transformer-based internally, so they go through almost the same call path as LLMs.
Chaining detection models (CNNs like DEIMv2) with recognition models (PARSeq/VLM) is mostly buffer passing, which is exactly where zero-copy helps most.

Audio models like Whisper and voice-cloning-style TTS (LuxTTS for example) have the same structure — passing mel-spectrogram tensors to the GPU.

For embeddings and rerankers, short-sequence encoders like Sentence Transformers v5.4 multimodal embeddings have lighter per-inference compute, so the ratio of host↔GPU round-trip cost is higher, and the zero-copy benefit is proportionally larger.

For non-generative GPGPU work like matrix computation, physics simulation, or image filters — anything that isn’t even machine learning — the same mechanism applies.
Metal kernels aren’t tied to LLMs, so if you can write a compute shader, you can run it.

In short, “anything that needs to hand a large byte array from a Wasm module to the GPU” is in scope, and LLMs are just the loudest example.
The zero-copy win is biggest when input/output tensors are in the MB-to-GB range; for tiny buffers, copy cost is small enough that the gap is hard to see.

Connection to the WASM + OCR history

This blog has a record of stumbling while trying to run WASM OCR in the browser.
Why @paddlejs-models/ocr didn’t run in the browser turned out to be an Emscripten-built WASM referencing Node.js-only globals — a runtime-compatibility issue.
In OCR Web 2025, the distribution-side concern of “we don’t want to ship a large dictionary or WASM” kept us with Tesseract.js.

Driftwood here is about Wasmtime running natively on macOS/Apple Silicon, not a browser, so distribution constraints and the Module-global issue don’t directly apply.
Still, the problem statement — “run a Wasm-hosted model fast on the real machine’s GPU” — sits on the same line as the older browser WASM OCR work.
On the browser side, WebGPU + WASM zero-copy is the same theme, and the spec-side discussion around sharing MLBuffer or WebGPU GPUBuffer with the Wasm linear memory is moving forward.
Rebuilding on-device OCR like NDLOCR-Lite iOS on top of a Wasm control plane + GPU is a plausible future configuration.

The positioning is clearer when you compare to past articles.

ArticleTopicRuntimeGPU integration
@paddlejs-models/ocr won’t runBrowser OCRBrowser WASMWebGL path, failed
OCR Web 2025Browser OCR in productionBrowser WASMTesseract.js only
NDLOCR-Lite iOSOn-device OCRONNX Runtime MobileCoreML/Metal
Flash-MoEHuge-model inferenceNativeMetal direct
SwiftLMLLM inferenceNative SwiftMetal direct
This one (Driftwood)LLM inferenceWasmtimeMetal + zero-copy

The axis is whether Wasm is inserted or not, and on performance the Wasm-based design is no longer visibly behind native.
That makes “keep distribution ease and don’t sacrifice GPU performance” a realistic position.

Non-LLM tasks benefit more

My personal interest leans toward non-generative inference — OCR, embeddings, search — which I think benefits more from this kind of structure.
Generative workloads produce output tokens one at a time, so the GPU↔CPU round trip is small relative to per-step compute.
In contrast, OCR, embeddings, and detection are throughput-oriented — batches go in, results come out — and the number of round trips dominates performance.

If an OCR engine distributable inside a Wasm sandbox can hit Apple Silicon’s GPU at native-ish speed, the distribution options for OCR-style services widen.
Today everyone ends up on Python + PyTorch/MLX or Swift + Core ML + Metal, but there’s now room for Wasm to sit in between.
Waiting for Driftwood’s OSS release aside, the design itself is already a useful reference if you want to build something similar yourself.


Original article: Zero-Copy GPU Inference from WebAssembly on Apple Silicon

Related documentation: