VecLite brings Rust/WASM vector search to the browser, making in-browser RAG plausible

A post appeared on DEV Community: I built a vector search library in Rust/WASM. Here’s what I learned about performance, browser limits, and building in public with AI.
Aakash T M built VecLite, a vector search library that runs entirely inside the browser.

The goal is privacy-first RAG.
Documents never leave the browser. Transformers.js generates embeddings on the client side, vectors are stored locally, and search happens in-browser.
No API key, no server, no external vector DB.

I recently wrote about running open-notebook as local RAG with Ollama alone and Mintlify replacing RAG with ChromaFs.
Both were about how to operate RAG. VecLite sits one layer deeper: how far can you push vector search inside the browser itself?

The gap in front of server-side vector DBs

A typical RAG pipeline splits documents into chunks, converts them to embeddings, and stores them in a search backend like Chroma, Pinecone, pgvector, or OpenSearch.

For use cases like personal notes, browser extensions, Electron apps, or offline search, a server-side vector DB is overkill.
If you have to send local documents to a server just to search them, you’ve already lost the privacy-first premise.

VecLite targets this gap.

flowchart TD
  A[Documents in browser] --> B[Transformers.js etc.<br/>Embedding generation]
  B --> C[VecLite<br/>Rust/WASM search]
  C --> D[IndexedDB etc.<br/>Local storage]
  C --> E[Search results<br/>fed to RAG context]

It’s less about replacing large vector DBs and more about building a search index that stays entirely within a single user’s device.
A very different layer from DB-integrated approaches like AliSQL’s HNSW vector search.

JavaScript’s limits come from memory and GC, not the math

Cosine similarity itself is simple arithmetic.
As covered in the vectors and matrices article, embedding search is roughly about computing “directional closeness” at scale.

The problem is volume.
The original article explains that for 10,000 vectors at 1,536 dimensions, a single search requires roughly 15 million floating-point multiplications.
For 100,000 vectors, multiply by ten.
The formula is trivial, but the browser’s main thread, GC, and memory layout take a real toll.

VecLite pushes this work into Rust/WASM.

Aspect	Pure JS	VecLite
Runtime	JavaScript	Rust/WASM + SIMD
Memory	JS objects / TypedArray	WASM linear memory
GC	Possible pauses during hot loops	No GC in search core
Data passing	Arrays may scatter depending on implementation	Flat `Float32Array` hand-off
Typical scale	Small-scale search	10k–100k in-browser search

That said, switching to Rust/WASM didn’t mean a 20x speedup.
The original benchmarks show VecLite at roughly 3.8–3.9x over pure JS for 1,536-dimension vectors.
10,000 vectors: 40 ms vs. 152 ms. 100,000 vectors: 400 ms vs. 1,576 ms.

Honestly, those are fair numbers.
V8 optimizes tight Float32Array loops pretty aggressively.
VecLite’s advantage comes from SIMD, better memory layout, and removing GC unpredictability — not from “JavaScript is slow so Rust is fast.”

Cross the WASM boundary carelessly and you lose everything

The most implementation-heavy part of the original article is the WASM boundary.
Even with Rust/WASM, calling into WASM one item at a time will eat you alive on boundary-crossing overhead.

VecLite enforces tight boundary rules:

upsert is always batched
Vectors are passed as flat Float32Array, not nested arrays
Metadata is serialized to JSON before crossing
Input validation happens on the TypeScript side so Rust never sees malformed data

The article shows that passing 10,000 items in a single WASM call takes 40 ms, while splitting into 10,000 calls degrades to 4-second territory.
This is a general rule for building AI components in the browser.
WASM is a fast computation box, but treat it like an RPC server with many fine-grained calls and it breaks down.

The same design principle showed up in WASM and Metal zero-copy inference.
The real speedup comes from “how not to copy data” rather than “which language to use.”

HNSW doesn’t always win with high-dimensional embeddings

The interesting part is how VecLite handles HNSW.
HNSW (Hierarchical Navigable Small World) is the go-to algorithm for approximate nearest neighbor search in vector DBs.
The common assumption is that HNSW, which navigates a graph to narrow candidates, beats flat search (brute-force distance calculation against every vector) at scale.

But in VecLite’s benchmarks, flat index was faster than HNSW at 1,536 dimensions.
Across 1,000, 5,000, and 10,000 vectors, flat won every time.

The reason comes down to scale and dimensionality in the browser context.
At 1,536 dimensions, each hop in HNSW touches a heavy vector.
The overhead of graph traversal outweighs the benefit of fewer candidate comparisons.

For low-dimensional, very large, and infrequently updated indexes, HNSW still has room to win.
But for ~100k vectors with OpenAI-class 1,536-dimension embeddings in the browser, flat search with SIMD is simply faster.

The real bottleneck might be persistence, not search

Looking at search speed alone, 400 ms for 100,000 vectors seems very usable.
But the original article and README both flag p99 spikes and persistence as the practical limits.

100,000 vectors at 1,536 dimensions in f32 take about 600 MB.
In f64, roughly 1.2 GB — extremely tight for a browser.
VecLite sticks to f32. Most embedding models effectively output f32 anyway, so this is natural.

However, the current design serializes the entire index as a JSON blob on save, which starts hurting around 50,000 vectors.
Search may be fast, but saving and restoring gets heavy.
Chunked persistence is planned.

Even with IndexedDB, any JavaScript in the same origin can read the data.
”Never leaves the server” is strong, but “stays completely secret inside the browser” is a different claim.
Browser extensions, XSS, and other scripts on the same origin remain risks.

Building with Claude Code: CLAUDE.md as a design constraint anchor

The second half of the original article covers building VecLite with Claude Code.
The key enabler was CLAUDE.md.

The author locked down architecture, API boundaries, WASM constraints, security rules, and explicit non-goals in CLAUDE.md.
This made it harder for Claude Code to loop back to already-rejected ideas like “embed WASM as base64” or “rebuild HNSW from scratch.”

This is the exact CLAUDE.md usage described in Claude Code Best Practices.
The more you delegate implementation to AI, the more “don’t let it re-propose settled design decisions” matters over task-level instructions.

On the other hand, human review was still needed for Rust SIMD intrinsics and WASM-target crate selection.
Proposing dependencies like rayon that don’t mesh with wasm32-unknown-unknown, or getting SIMD horizontal sums and remainder handling wrong — those are the kinds of misses.
Boilerplate and test generation are easy to delegate; performance-critical boundaries stay as review targets.

Where in-browser RAG becomes practical

VecLite alone doesn’t make a RAG application.
Chunking, embedding generation, reranking, citation display, incremental updates, and access control are all separate concerns.
The embedding/reranker side is progressing too, as seen with Sentence Transformers v5.4, but running everything lightweight inside the browser remains a stretch.

Still, if the search core can handle 10k–100k vectors at realistic speed, new doors open:

A browser extension semantic-searching your browsing history and local notes on-device
An Electron app searching user documents without sending them to the cloud
A PWA with offline search via Service Worker and IndexedDB
Small internal tools running per-user RAG without standing up a server-side vector DB

References