Mintlify ditched RAG and switched to a virtual file system
Contents
RAG has become a standard approach. You split documents into chunks, store them in a vector database, and pull chunks similar to the user’s query into the LLM context. In practice, people keep finding ways to improve precision, such as Chroma’s search agent Context-1 and PageIndex’s tree RAG.
Mintlify, however, abandoned the RAG paradigm itself. Instead, it built ChromaFs, a virtual file system. You give an AI agent a bash shell and let it explore documentation with grep and cat. Every file-system access is translated into a ChromaDB query.
This article walks back through the basics of RAG and digs into what ChromaFs is actually solving.
What RAG Actually Is
RAG, or Retrieval-Augmented Generation, is a method for supplementing an LLM’s knowledge with external data in real time. In 2020, Meta, then Facebook AI Research, proposed it in the paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.”
An LLM only knows what it saw during training. If you ask it about internal documents or the latest product manual, it can hallucinate plausible-sounding nonsense. RAG solves that problem with a simple idea: retrieve documents related to the question and attach them to the prompt.
graph TD
A["User question"] --> B["Vectorize with an embedding model"]
B --> C["Similar-search in the vector DB"]
C --> D["Fetch top-k chunks"]
D --> E["Send question + chunks to the LLM"]
E --> F["Generate answer"]
Broken down, the flow looks like this:
- Split documents into chunks of a few hundred tokens each
- Vectorize each chunk with an embedding model such as OpenAI’s
text-embedding-3-small - Store the vectors in a vector DB such as Chroma, Pinecone, or Weaviate
- Vectorize the user’s question and retrieve nearby chunks with cosine similarity
- Include the retrieved chunks in the prompt and send them to the LLM
The reason this “search and pass along” mechanism was so powerful is that it could reflect the latest information without fine-tuning the LLM. When documentation changes, you only need to update the vector DB.
Embeddings and Cosine Similarity
The core of RAG is the embedding. It converts text into a numeric vector with hundreds or thousands of dimensions, and texts with similar meanings end up close together.
For example, “How to sort a list in Python” and “How do I reorder an array in Python?” are different strings, but their embedding vectors are very close. By contrast, “How to sort a list in Python” and “Python is a snake native to South America” may look similar as strings, but their vectors are far apart.
Cosine similarity measures how close two vectors are. It calculates the cosine of the angle between them. The closer to 1, the more similar; the closer to 0, the less related; and the closer to -1, the more opposite the meaning.
similarity = cos(θ) = (A · B) / (|A| × |B|)
RAG relies on the assumption that “the chunk closest to the question vector probably contains the information needed to answer the question.” That assumption works well in many cases, but not always. That is the structural limit discussed later.
Vector DBs, KVSs, and RDBMSs
Vector DBs show up naturally in RAG discussions, but why not use an existing database? Vector DBs, KVSs, and RDBMSs are fundamentally optimized for different kinds of queries.
| Property | RDBMS | KVS | Vector DB |
|---|---|---|---|
| Data model | Tables (rows and columns) | Key-value pairs | Vectors + metadata |
| Main query style | SQL (filters, joins) | Exact key lookup | Nearest-neighbor similarity search |
| Query example | WHERE category = 'API' | GET doc:12345 | ”chunks similar to authentication setup” |
| Indexes | B-tree, hash | Hash table | ANN structures such as HNSW and IVF |
| Scalability | Vertical-first | Horizontal-friendly | Horizontally scalable |
| Typical implementations | PostgreSQL, MySQL | Redis, DynamoDB | Chroma, Pinecone, Weaviate |
| Best fit | Structured data and complex queries | Sessions and cache | Semantic similarity search |
Why RDBMSs Alone Are Not Enough for RAG
RDBMSs can do full-text search too, for example with LIKE '%authentication%' or tsvector. But that is keyword matching, not semantic similarity. Searching for “authentication setup” will not necessarily surface a page titled “initial login procedure.”
PostgreSQL has pgvector, an extension that adds vector columns and cosine-similarity indexing. That makes vector search possible inside an RDBMS, but it can still lag dedicated vector DBs in index efficiency and scalability. For small to medium projects, adding pgvector to an existing PostgreSQL instance is often the simplest choice.
Vector DB Index Structures: HNSW
Vector DBs can do fast approximate nearest-neighbor search because they use specialized index structures. The most widely adopted one is HNSW, which Chroma also uses by default.
graph TD
A["Search query vector"] --> B["Layer 2 (sparse graph)<br/>Locate the rough area with only a few nodes"]
B --> C["Layer 1 (middle graph)<br/>Narrow the range"]
C --> D["Layer 0 (dense graph)<br/>Search nearby nodes precisely"]
D --> E["Return the top-k nearest neighbors"]
HNSW is a multilayer graph. Upper layers have fewer nodes and long-range edges; lower layers are denser and have shorter edges. At query time, the search starts at the top layer, follows nodes close to the query vector, and descends layer by layer. The closest nodes in the bottom layer become the result set.
That structure avoids computing distances to every vector, so approximate nearest neighbors can be found quickly. Even with a million vectors, search can complete in milliseconds. The tradeoff is that it is approximate, so there is no guarantee that the exact nearest neighbor will be returned.
What KVS Is Used For
Redis also appears in ChromaFs’ architecture. The KVS is good at fast key-based reads and writes, and at caching. After the vector DB narrows down the candidate files, the actual chunk data is fetched from Redis. In other words, the vector DB handles search and the KVS handles retrieval.
The Structural Limits of RAG
Now we get to Mintlify’s problem. Mintlify is a documentation platform used by many companies, including Discord, Vercel, and Cursor. It powers an AI assistant that handles 850,000 conversations per month, but conventional RAG has structural limits.
The Weakness of Single-Pass Search
The standard RAG pipeline is single-pass: retrieve relevant chunks once and stop. That breaks in a few common cases.
| Case | Problem |
|---|---|
| The answer spans multiple pages | Vector search returns chunks similar to a single query. For comparative questions like “What is the difference between A and B?”, you may need both A and B, but only get one side |
| Exact values matter | Exact API signatures and configuration parameters are hard to capture with vector similarity. A query like “default value of the third argument of createUser” might return a semantically similar but wrong API description |
| Surrounding context matters | Chunking can destroy page context. If a chunk only makes sense after reading the previous section, the LLM may still misread it |
This is a long-known issue in the structural weakness of vector-search RAG. Many improvements, such as multi-hop search and reranking, have been proposed, but they add pipeline complexity.
Mintlify’s Sandbox Approach
Mintlify’s earlier approach used a sandbox: it launched a container for each conversation, mounted the documents, and let the agent explore freely with bash. Accuracy was high, but the P90 boot time was about 46 seconds and the cost per conversation was $0.0137. At 850,000 conversations per month, that works out to more than $70,000 in annual infrastructure cost.
graph TD
A["Conversation request"] --> B["Start container<br/>(P90: about 46 seconds)"]
B --> C["Mount documents into the file system"]
C --> D["Let the agent explore freely with bash"]
D --> E["Generate answer"]
E --> F["Destroy the container"]
The accuracy was genuinely good. The agent could inspect directory structure, read files, and use grep to search across docs, just like a human. The problem was cost and latency.
ChromaFs Architecture
ChromaFs is designed to balance sandbox-level accuracy with RAG-level cost efficiency. It removes the need to start containers while still giving the agent the ability to explore with bash.
ChromaFs is built on just-bash, developed by Vercel Labs. just-bash is a TypeScript reimplementation of bash. It rewrites UNIX commands like grep, cat, ls, find, and cd from scratch so they work in a browser environment. It is not a wrapper around a real shell. The parser and command execution are both written in TypeScript.
just-bash provides a pluggable file system interface called IFileSystem. ChromaFs implements that interface and translates file-system operations into ChromaDB queries.
graph TD
A["Agent"] --> B["bash commands<br/>ls, cat, grep, etc."]
B --> C["just-bash<br/>TypeScript bash emulator"]
C --> D["IFileSystem interface"]
D --> E["ChromaFs<br/>Virtual file system"]
E --> F["Chroma DB"]
E --> G["Redis<br/>Chunk cache"]
The agent thinks it is issuing normal bash commands, but underneath, it is really querying a vector database.
Initializing the Directory Tree
The file structure for the entire documentation set is stored in the Chroma collection as gzip-compressed JSON (__path_tree__). At startup, ChromaFs expands it into two in-memory structures.
| Data structure | Contents | Purpose |
|---|---|---|
Set<string> | Set of all file paths | Check whether a path exists (O(1)) |
Map<string, string[]> | Mapping from directories to child entries | Respond to ls |
On the second and later sessions, the cached tree is used, so there are zero network calls. Because gzip compression keeps the tree small, even a documentation set with thousands of pages can be expanded in a few milliseconds.
Reconstructing Pages
Documents are split into chunks when stored in Chroma. When cat reads a page, ChromaFs fetches all chunks matching the page slug and concatenates them in chunk_index order.
The key point is that the vector DB acts as the backend of a file system. In ordinary RAG, chunks are retrieved by vector similarity. In ChromaFs’ cat, chunks are fetched by a metadata query (WHERE slug = 'target-page'). No vectors are used at all. The goal is to reconstruct the file contents exactly, so similarity search is unnecessary.
Repeated access to the same page is absorbed by the cache. By keeping prefetched chunks in Redis, the second and later cat commands can answer without touching the DB.
Two-Stage grep
Recursive search with grep -r would otherwise scan the entire documentation set. ChromaFs handles it in two stages.
graph TD
A["grep -r 'pattern' /docs"] --> B["Stage 1: Coarse filter<br/>Use a Chroma metadata query to narrow the candidate files"]
B --> C["Candidate files"]
C --> D["Stage 2: Fine filter<br/>Run regex matching on chunks in Redis"]
D --> E["Search results<br/>Line number + matching line"]
Stage 1 uses a Chroma metadata query. Rather than vector similarity, it checks whether the chunk contents contain the search pattern, and that greatly reduces the candidate file set.
Stage 2 performs in-memory regex matching on the chunks already prefetched into Redis. After reconstructing the actual file contents, it returns search results with line numbers.
By splitting the work this way, recursive search across large docs sets, even thousands of pages, can still finish in milliseconds. Unlike a linear file-system scan, the index-based narrowing in stage 1 does the heavy lifting.
Write Operations Are Forbidden
All write operations return an EROFS (Read-Only File System) error. This is intentional.
- State does not leak across sessions, so one agent cannot break another session
- The source of truth for the documentation is the data inside Chroma DB, and changing it through the virtual file system could break consistency
- Read-only behavior makes cache invalidation much simpler
Performance Comparison
The difference from the sandbox approach is dramatic.
| Metric | Sandbox | ChromaFs |
|---|---|---|
| P90 boot time | About 46 seconds | About 100 milliseconds |
| Marginal cost per conversation | $0.0137 | $0 (reuse existing DB) |
| Search method | Linear disk scan | DB metadata query |
| Annual infrastructure cost at 850,000 conversations/month | More than $70,000 | Effectively $0 |
Boot time became 460 times faster, and marginal cost fell to zero. That matters at the scale of more than 30,000 conversations per day.
The reason the cost becomes zero is that ChromaFs reuses an existing Chroma DB. The sandbox approach created a new container and file system for each conversation, but ChromaFs has every session query the same Chroma collection. There is no additional infrastructure cost.
Access Control
Documentation platforms need different users to see different pages. In ChromaFs, each path-tree entry carries isPublic and groups fields.
graph TD
A["Session starts"] --> B["Read user permissions from the session token"]
B --> C["Exclude unauthorized paths while building the tree"]
C --> D["File system visible to the agent"]
D --> E["ls: only authorized files"]
D --> F["grep: only authorized chunks"]
D --> G["cat: unauthorized files return ENOENT"]
When building the tree, ChromaFs uses the session token to prune paths the user cannot access, and then applies the same filter to later Chroma queries.
An unauthorized file is treated as “does not exist” rather than “access denied.” It does not show up in ls, and grep does not find it. In the sandbox approach, the file system had to be mounted again for each conversation, but ChromaFs filters at the query layer, so container startup is unnecessary.
This “it does not exist” approach also makes security sense. Returning access denied would reveal that the file exists. Hiding existence itself minimizes the risk of information leakage.
just-bash Design Principles
Vercel Labs’ just-bash is the foundation of ChromaFs, and it is a collection of interesting design decisions on its own.
Unlike a normal shell such as bash or zsh, just-bash has the following properties:
| Property | Normal bash | just-bash |
|---|---|---|
| Runtime | Process on the OS | TypeScript runtime |
| Shell state | Preserved across commands | Reset on every exec() |
| File system | Provided by the OS | Pluggable (IFileSystem) |
| Network | Allowed by default | Disabled by default |
| Execution model | fork + exec | Function calls |
Resetting state on every exec() is clearly designed with AI agents in mind. Agent tool calls are independent, and it is safer if they do not rely on side effects from previous commands such as environment changes or cd.
Disabling network access by default is also a crucial sandbox property. Only explicitly whitelisted endpoints with URL prefixes can be accessed. That reduces the risk of prompt injection causing the agent to make unintended external calls.
The biggest advantage of just-bash for Mintlify is its pluggable file system. By swapping the IFileSystem implementation, Mintlify can point the file system the agent sees at Chroma DB. Because just-bash still exposes the @vercel/sandbox-compatible API, the migration cost from the existing sandbox-based code is low.
The Difference Between RAG and Tool Use
The essence of ChromaFs is the difference between “RAG” and “tool use.”
graph LR
subgraph "RAG paradigm"
R1["Query"] --> R2["Search"]
R2 --> R3["Retrieve chunks"]
R3 --> R4["Pass to LLM"]
R4 --> R5["Answer"]
end
subgraph "Tool-use paradigm"
T1["Question"] --> T2["Use ls to understand structure"]
T2 --> T3["Use cat to read"]
T3 --> T4["Use grep across files"]
T4 --> T5["Additional cat"]
T5 --> T6["Answer"]
end
In RAG, you have to decide what to search for when you issue the query. It is a single pass, so if the result is insufficient, there is no real chance to recover. There are agentic multi-hop RAG systems too, but they mostly just loop through search, retrieval, and search again. The freedom to explore is still limited.
In the tool-use paradigm, the agent builds its own exploration strategy. It uses ls to inspect the directory tree, cat to read files and deepen its understanding, and then uses that evidence to decide the next grep. That is exactly how a human would look for documentation.
You can also phrase the difference as “search” versus “exploration.”
| Item | RAG (search) | ChromaFs (exploration) |
|---|---|---|
| Strategy flexibility | Fixed at query time | Can change mid-flight |
| Referencing multiple pages | Depends on chunk limit | Freely read multiple files |
| Structural awareness | None, just flat chunks | Directory structure is visible |
| Exact value retrieval | Depends on vector similarity | Exact-match search with grep |
| Number of LLM calls | One | Multiple, depending on tool calls |
| Latency | Low, one pass | Slightly higher, multiple turns |