Reading an Article on Building a Local PDF RAG with FastAPI, llama.cpp, Chroma, and Open WebUI
Contents
A post titled “Building a Persistent Knowledge Base RAG System with FastAPI, llama.cpp, Chroma, and Open WebUI” showed up on DEV Community.
It ingests a PDF folder into Chroma, exposes OpenAI-compatible /v1/chat/completions via FastAPI, routes queries to a local GGUF model on llama.cpp, and lets you chat through Open WebUI.
Not a new product launch.
But as a minimal setup that takes local RAG all the way to “something you can actually use as an app,” it’s pretty clear.
I previously wrote about building an internal helpdesk RAG with Mac mini M4 Pro + Dify, which was a low-code setup centered on Dify.
This article’s setup goes the other direction: no workflow platform like Dify, just a thin API layer for RAG that you own entirely.
Using the OpenAI-Compatible API as the Boundary
The most effective aspect of the article’s architecture is that the FastAPI side serves OpenAI-compatible /v1/models and /v1/chat/completions.
flowchart LR
PDF[PDF folder] --> Ingest[PDF loading<br/>chunk splitting]
Ingest --> Chroma[(Chroma<br/>persistent vector DB)]
User[Open WebUI] -->|OpenAI-compatible API| FastAPI[FastAPI RAG API]
FastAPI -->|similarity search| Chroma
FastAPI -->|OpenAI-compatible API| Llama[llama.cpp server]
Llama --> FastAPI
FastAPI --> User
Open WebUI can connect to any server that speaks the OpenAI-compatible API.
The official docs explain that when Open WebUI runs inside Docker, you use host.docker.internal instead of localhost to reach a model server on the host.
llama.cpp’s llama-server also exposes OpenAI-compatible /v1/chat/completions.
So FastAPI acts as an OpenAI-compatible server from the UI’s perspective and an OpenAI-compatible client from the LLM’s perspective.
With this boundary in place, swapping Open WebUI for LM Studio, vLLM, or an Ollama-compatible server feels like the same kind of operation.
Not Dify, Not open-notebook, Just a Thin Custom API
This setup makes sense when you want to directly control what happens inside RAG.
In the article on running open-notebook fully local on M1 Max, I ran SurrealDB, FastAPI, Next.js, Worker, and Ollama together.
PDF/URL ingestion, notebooks, citations, insight generation: it was a complete app, but the internals were substantial.
The DEV article isn’t that kind of app.
Read PDFs, chunk them, put them in Chroma, search and pass results to the LLM.
FastAPI holds just that thin pipeline.
| Setup | Good for | Weak at |
|---|---|---|
| Dify | Internal helpdesks, workflow-heavy operations | Hard to fine-tune low-level search logic |
| open-notebook | When you want a finished NotebookLM-style app | Many internal components, wide verification surface |
| FastAPI + Chroma + llama.cpp | When you want full code-level control over the RAG API | Auth, permissions, UI, and job management are on you |
So if you want to “ship something to production quickly,” Dify or open-notebook is faster.
If you want to see the search conditions, prompts, chunk updates, and model calls all in code, a thin API like this one fits.
Persistence Lives in Chroma’s Directory
The article sets ./vector_store as Chroma’s persist_directory.
When Chroma starts as a persistent client or server, it creates storage files like chroma.sqlite3 under that directory.
”Persistent” here means you can reload collections after a restart without re-ingesting PDFs.
Persistence and update management are separate things.
Just having the vector DB on disk doesn’t solve these operational issues:
| Concern | Why it matters |
|---|---|
| Chunk IDs | Determines whether re-ingesting the same PDF creates duplicates or replaces |
| Metadata | Without page numbers, timestamps, and hashes beyond just filenames, incremental updates are difficult |
| Deletion | You need to decide whether to keep or remove old chunks when a PDF is deleted |
| Embedding model name | When you switch models, you need to know whether to regenerate existing vectors |
The DEV article’s code uses /reload to delete and recreate the collection from scratch.
That’s fine for a small PDF folder, but once you pass a few thousand pages, you’ll want incremental updates.
This connects to the same problem I touched on in the article about Mintlify dropping RAG for ChromaFs: how to restore chunks as files.
RAG is simple if all you do is search. Once you start managing document lifecycles, it quickly becomes a database design problem.
Things to Fix Before Copying the Code
The original code reads best as a learning scaffold.
Before copying and running it, at least check these:
| Spot | Issue |
|---|---|
| Worker function name | The definition is _ingest_pdfs_worker, but the thread launch side appears to reference ingest_pdfs_worker |
| Chroma integration | The langchain_community.vectorstores.Chroma + persist() pattern may need updating to langchain_chroma in newer environments |
| Streaming | Accepts stream but the implementation only returns non-streaming responses |
| Auth | Even with Open WebUI as the client, exposing on LAN calls for an API key or reverse proxy |
| Citations | Returns filenames but not page numbers or chunk positions, making verification weak |
The worker function name in particular looks like a simple transcription error.
These “full-stack tutorial” articles are useful for picking up the architectural skeleton, but test the code in your own environment before relying on it.
Search Quality Starts Here
The article’s setup is quite bare-bones as a search system.
It embeds chunks with sentence-transformers/all-MiniLM-L6-v2, retrieves the top 4 via Chroma’s retriever, and stuffs them into the prompt.
For English-language PDFs, that works as a starting point.
For Japanese PDFs or internal docs, the real work begins here.
| Change | What it improves |
|---|---|
Switch embedding to a multilingual model like bge-m3 | More stable semantic search in Japanese |
| Add hybrid search with BM25 | Reduces misses on model numbers, error codes, and proper nouns |
| Include page numbers in metadata | Easier to verify answer sources in the PDF |
| Add reranking | Reduces noise in top-k results |
| Vary chunk size by document type | Adjustable search granularity for tables, procedures, and regulatory text |
As I wrote in the Chroma Context-1 article, recent RAG is moving from single-shot vector search toward iterating queries and pruning irrelevant chunks.
This FastAPI setup doesn’t go that far, but that’s exactly what makes it useful as a minimal test bed for the stage before that.
How Far to Go Local
The article’s motivation is “read a PDF collection with a local LLM without sending data outside.”
This has become quite realistic.
Open WebUI, llama.cpp, Chroma, and FastAPI all run locally, and with a GGUF model you can try it on CPU or Apple Silicon even without a strong GPU.
Going local also means the responsibility moves to your hands.
Model quality, PDF updates and re-indexing, API exposure and Open WebUI authentication, startup order and logs and backups.
”Not sending data outside” is a strong reason, but you take on everything that cloud APIs and SaaS providers used to handle.
To start small, feed about 10 PDFs into this setup.
Then collect only the queries that failed to find the right answer, the responses that cited the wrong page, and the answers that paraphrased proper nouns.
RAG improvements go faster when you decompose them into which part to fix: embedding, chunks, metadata, or prompt.
When PDFs Aren’t Enough
Everything so far starts with a PDF folder.
But knowledge that accumulates locally isn’t just PDFs.
Terminal output saved as screenshots, Slack conversations captured, whiteboard photos.
There are cases where searching images directly is less work than converting to text first.
In the OCR-Memory article, I covered an approach that saves long agent execution history as images and searches over them with a VLM.
Compressing text into images lets visual token compression work, and search precision in limited context windows can beat text-based RAG.
That method targets agent memory, but borrowing the idea points toward RAG that treats images as first-class inputs.
With Sentence Transformers v5.4, you can pass both text and images to the same encode() and embed them in the same vector space.
Text chunks from PDFs and screenshot or diagram images go into the same Chroma collection.
A text query hits both.
A Fully Local Multimodal RAG Setup
To build a mixed PDF + image knowledge base entirely local, you add multimodal embedding and a VLM on top of the first half’s setup.
flowchart TD
PDF[PDF folder] --> TextChunk[Text chunking]
IMG[Image folder<br/>screenshots, diagrams, photos] --> ImgEmbed[Multimodal<br/>embedding]
TextChunk --> TextEmbed[Multimodal<br/>embedding]
TextEmbed --> Chroma[(Chroma<br/>unified index)]
ImgEmbed --> Chroma
Query[Text query] --> QEmbed[Query embedding]
QEmbed --> Chroma
Chroma --> TextHit[Text chunks]
Chroma --> ImgHit[Images]
ImgHit --> VLM[Local VLM<br/>Qwen2.5-VL 7B]
VLM --> Desc[Image description text]
TextHit --> LLM[llama.cpp<br/>GGUF model]
Desc --> LLM
LLM --> Answer[Answer]
Two things change from the first half’s setup.
The embedding model switches from all-MiniLM-L6-v2 to a multimodal one.
Using BGE-VL-base (0.1B) via Sentence Transformers v5.4, you can embed images and text into the same vectors with under 1 GB of VRAM.
For better accuracy, Qwen3-VL-Embedding-2B works, but requires around 8 GB of VRAM.
The other change is that search results can include images, so you need a path that generates description text via a VLM before passing images to the LLM.
I confirmed in the local VLM experiment that Qwen2.5-VL 7B runs stably through Ollama.
The flow is: image hit -> VLM description -> inject into LLM.
Since llama.cpp’s GGUF models can’t accept images directly, the VLM sits in between as a text converter.
Parts Available Locally
Excluding TTS and image generation, you need text understanding, image understanding, and embedding.
Pulling from the Japanese LLM comparison article and past articles, here’s what runs on M1/M4 Macs:
| Role | Example model | Size | Environment |
|---|---|---|---|
| Text LLM | LLM-jp-4-32B-A3B | 32B MoE (3.8B active) | Apple Silicon 16GB+ |
| Text LLM | Qwen3.5-35B-A3B | 35B MoE | Apple Silicon 32GB+ |
| VLM | Qwen2.5-VL 7B | 7B | Apple Silicon 16GB+, via Ollama |
| Multimodal embedding | BGE-VL-base | 0.1B | CPU OK |
| Multimodal embedding | Qwen3-VL-Embedding-2B | 2B | GPU 8GB+ |
| Vector DB | Chroma | — | CPU, disk-persistent |
| Model server | llama.cpp / Ollama | — | CPU or GPU |
Bundled runtimes like Foundry Local are emerging, but if you want to control multimodal embedding yourself, Sentence Transformers + Chroma + FastAPI gives more flexibility.
BGE-VL-base runs on CPU, so embedding doesn’t eat into your GPU budget.
VLM and LLM competing for the same GPU is an issue, but Ollama’s model switching handles it.
If you want both running simultaneously, an Apple Silicon machine with more memory is more practical.
On an M1 Max 64GB, running both Qwen2.5-VL 7B and LLM-jp-4-32B-A3B on Ollama with embedding on a separate Sentence Transformers process is realistic.
How to Handle Image Ingestion
PDFs have a well-established pipeline: text extraction with PyPDFLoader, chunk splitting, embedding.
Images need a bit more thought.
You can pass images directly through multimodal embedding to get vectors.
But to include images returned from Chroma in the LLM prompt, you need to convert them to text.
llama.cpp’s GGUF models don’t accept image inputs.
Two options: generate captions with a VLM at ingestion time and store them in metadata, or run the VLM on-demand when an image comes up in search results.
Generating captions at ingestion makes the ingestion process heavier but keeps search responses fast.
For large image collections, batching caption generation overnight is practical.
On-demand VLM processing keeps ingestion light but adds VLM inference on every search that hits an image.
If image hits are rare in your search patterns, this approach wastes less compute.
Storing captions in metadata means images become searchable through both text search and multimodal search.
Vector search finds similar images; BM25 finds caption text.
The hybrid search discussed in the first half works for images too.
If you only have PDFs, the first half of this article’s setup is sufficient. If screenshots and whiteboard photos are in the mix, adding the image path is worth it.
Look at what format your knowledge actually accumulates in, then decide.