Reading an Article on Building a Local PDF RAG with FastAPI, llama.cpp, Chroma, and Open WebUI

A post titled “Building a Persistent Knowledge Base RAG System with FastAPI, llama.cpp, Chroma, and Open WebUI” showed up on DEV Community.
It ingests a PDF folder into Chroma, exposes OpenAI-compatible /v1/chat/completions via FastAPI, routes queries to a local GGUF model on llama.cpp, and lets you chat through Open WebUI.

Not a new product launch.
But as a minimal setup that takes local RAG all the way to “something you can actually use as an app,” it’s pretty clear.

I previously wrote about building an internal helpdesk RAG with Mac mini M4 Pro + Dify, which was a low-code setup centered on Dify.
This article’s setup goes the other direction: no workflow platform like Dify, just a thin API layer for RAG that you own entirely.

Using the OpenAI-Compatible API as the Boundary

The most effective aspect of the article’s architecture is that the FastAPI side serves OpenAI-compatible /v1/models and /v1/chat/completions.

flowchart LR
  PDF[PDF folder] --> Ingest[PDF loading<br/>chunk splitting]
  Ingest --> Chroma[(Chroma<br/>persistent vector DB)]
  User[Open WebUI] -->|OpenAI-compatible API| FastAPI[FastAPI RAG API]
  FastAPI -->|similarity search| Chroma
  FastAPI -->|OpenAI-compatible API| Llama[llama.cpp server]
  Llama --> FastAPI
  FastAPI --> User

Open WebUI can connect to any server that speaks the OpenAI-compatible API.
The official docs explain that when Open WebUI runs inside Docker, you use host.docker.internal instead of localhost to reach a model server on the host.

llama.cpp’s llama-server also exposes OpenAI-compatible /v1/chat/completions.
So FastAPI acts as an OpenAI-compatible server from the UI’s perspective and an OpenAI-compatible client from the LLM’s perspective.
With this boundary in place, swapping Open WebUI for LM Studio, vLLM, or an Ollama-compatible server feels like the same kind of operation.

Not Dify, Not open-notebook, Just a Thin Custom API

This setup makes sense when you want to directly control what happens inside RAG.

In the article on running open-notebook fully local on M1 Max, I ran SurrealDB, FastAPI, Next.js, Worker, and Ollama together.
PDF/URL ingestion, notebooks, citations, insight generation: it was a complete app, but the internals were substantial.

The DEV article isn’t that kind of app.
Read PDFs, chunk them, put them in Chroma, search and pass results to the LLM.
FastAPI holds just that thin pipeline.

Setup	Good for	Weak at
Dify	Internal helpdesks, workflow-heavy operations	Hard to fine-tune low-level search logic
open-notebook	When you want a finished NotebookLM-style app	Many internal components, wide verification surface
FastAPI + Chroma + llama.cpp	When you want full code-level control over the RAG API	Auth, permissions, UI, and job management are on you

So if you want to “ship something to production quickly,” Dify or open-notebook is faster.
If you want to see the search conditions, prompts, chunk updates, and model calls all in code, a thin API like this one fits.

Persistence Lives in Chroma’s Directory

The article sets ./vector_store as Chroma’s persist_directory.
When Chroma starts as a persistent client or server, it creates storage files like chroma.sqlite3 under that directory.
”Persistent” here means you can reload collections after a restart without re-ingesting PDFs.

Persistence and update management are separate things.
Just having the vector DB on disk doesn’t solve these operational issues:

Concern	Why it matters
Chunk IDs	Determines whether re-ingesting the same PDF creates duplicates or replaces
Metadata	Without page numbers, timestamps, and hashes beyond just filenames, incremental updates are difficult
Deletion	You need to decide whether to keep or remove old chunks when a PDF is deleted
Embedding model name	When you switch models, you need to know whether to regenerate existing vectors

The DEV article’s code uses /reload to delete and recreate the collection from scratch.
That’s fine for a small PDF folder, but once you pass a few thousand pages, you’ll want incremental updates.

This connects to the same problem I touched on in the article about Mintlify dropping RAG for ChromaFs: how to restore chunks as files.
RAG is simple if all you do is search. Once you start managing document lifecycles, it quickly becomes a database design problem.

Things to Fix Before Copying the Code

The original code reads best as a learning scaffold.
Before copying and running it, at least check these:

Spot	Issue
Worker function name	The definition is `_ingest_pdfs_worker`, but the thread launch side appears to reference `ingest_pdfs_worker`
Chroma integration	The `langchain_community.vectorstores.Chroma` + `persist()` pattern may need updating to `langchain_chroma` in newer environments
Streaming	Accepts `stream` but the implementation only returns non-streaming responses
Auth	Even with Open WebUI as the client, exposing on LAN calls for an API key or reverse proxy
Citations	Returns filenames but not page numbers or chunk positions, making verification weak

The worker function name in particular looks like a simple transcription error.
These “full-stack tutorial” articles are useful for picking up the architectural skeleton, but test the code in your own environment before relying on it.

Search Quality Starts Here

The article’s setup is quite bare-bones as a search system.
It embeds chunks with sentence-transformers/all-MiniLM-L6-v2, retrieves the top 4 via Chroma’s retriever, and stuffs them into the prompt.

For English-language PDFs, that works as a starting point.
For Japanese PDFs or internal docs, the real work begins here.

Change	What it improves
Switch embedding to a multilingual model like `bge-m3`	More stable semantic search in Japanese
Add hybrid search with BM25	Reduces misses on model numbers, error codes, and proper nouns
Include page numbers in metadata	Easier to verify answer sources in the PDF
Add reranking	Reduces noise in top-k results
Vary chunk size by document type	Adjustable search granularity for tables, procedures, and regulatory text

As I wrote in the Chroma Context-1 article, recent RAG is moving from single-shot vector search toward iterating queries and pruning irrelevant chunks.
This FastAPI setup doesn’t go that far, but that’s exactly what makes it useful as a minimal test bed for the stage before that.

How Far to Go Local

The article’s motivation is “read a PDF collection with a local LLM without sending data outside.”
This has become quite realistic.
Open WebUI, llama.cpp, Chroma, and FastAPI all run locally, and with a GGUF model you can try it on CPU or Apple Silicon even without a strong GPU.

Going local also means the responsibility moves to your hands.
Model quality, PDF updates and re-indexing, API exposure and Open WebUI authentication, startup order and logs and backups.
”Not sending data outside” is a strong reason, but you take on everything that cloud APIs and SaaS providers used to handle.

To start small, feed about 10 PDFs into this setup.
Then collect only the queries that failed to find the right answer, the responses that cited the wrong page, and the answers that paraphrased proper nouns.
RAG improvements go faster when you decompose them into which part to fix: embedding, chunks, metadata, or prompt.

When PDFs Aren’t Enough

Everything so far starts with a PDF folder.
But knowledge that accumulates locally isn’t just PDFs.
Terminal output saved as screenshots, Slack conversations captured, whiteboard photos.
There are cases where searching images directly is less work than converting to text first.

In the OCR-Memory article, I covered an approach that saves long agent execution history as images and searches over them with a VLM.
Compressing text into images lets visual token compression work, and search precision in limited context windows can beat text-based RAG.
That method targets agent memory, but borrowing the idea points toward RAG that treats images as first-class inputs.

With Sentence Transformers v5.4, you can pass both text and images to the same encode() and embed them in the same vector space.
Text chunks from PDFs and screenshot or diagram images go into the same Chroma collection.
A text query hits both.

A Fully Local Multimodal RAG Setup

To build a mixed PDF + image knowledge base entirely local, you add multimodal embedding and a VLM on top of the first half’s setup.

flowchart TD
    PDF[PDF folder] --> TextChunk[Text chunking]
    IMG[Image folder<br/>screenshots, diagrams, photos] --> ImgEmbed[Multimodal<br/>embedding]
    TextChunk --> TextEmbed[Multimodal<br/>embedding]
    TextEmbed --> Chroma[(Chroma<br/>unified index)]
    ImgEmbed --> Chroma
    Query[Text query] --> QEmbed[Query embedding]
    QEmbed --> Chroma
    Chroma --> TextHit[Text chunks]
    Chroma --> ImgHit[Images]
    ImgHit --> VLM[Local VLM<br/>Qwen2.5-VL 7B]
    VLM --> Desc[Image description text]
    TextHit --> LLM[llama.cpp<br/>GGUF model]
    Desc --> LLM
    LLM --> Answer[Answer]

Two things change from the first half’s setup.

The embedding model switches from all-MiniLM-L6-v2 to a multimodal one.
Using BGE-VL-base (0.1B) via Sentence Transformers v5.4, you can embed images and text into the same vectors with under 1 GB of VRAM.
For better accuracy, Qwen3-VL-Embedding-2B works, but requires around 8 GB of VRAM.

The other change is that search results can include images, so you need a path that generates description text via a VLM before passing images to the LLM.
I confirmed in the local VLM experiment that Qwen2.5-VL 7B runs stably through Ollama.
The flow is: image hit -> VLM description -> inject into LLM.
Since llama.cpp’s GGUF models can’t accept images directly, the VLM sits in between as a text converter.

Parts Available Locally

Excluding TTS and image generation, you need text understanding, image understanding, and embedding.
Pulling from the Japanese LLM comparison article and past articles, here’s what runs on M1/M4 Macs:

Role	Example model	Size	Environment
Text LLM	LLM-jp-4-32B-A3B	32B MoE (3.8B active)	Apple Silicon 16GB+
Text LLM	Qwen3.5-35B-A3B	35B MoE	Apple Silicon 32GB+
VLM	Qwen2.5-VL 7B	7B	Apple Silicon 16GB+, via Ollama
Multimodal embedding	BGE-VL-base	0.1B	CPU OK
Multimodal embedding	Qwen3-VL-Embedding-2B	2B	GPU 8GB+
Vector DB	Chroma	—	CPU, disk-persistent
Model server	llama.cpp / Ollama	—	CPU or GPU

Bundled runtimes like Foundry Local are emerging, but if you want to control multimodal embedding yourself, Sentence Transformers + Chroma + FastAPI gives more flexibility.

BGE-VL-base runs on CPU, so embedding doesn’t eat into your GPU budget.
VLM and LLM competing for the same GPU is an issue, but Ollama’s model switching handles it.
If you want both running simultaneously, an Apple Silicon machine with more memory is more practical.
On an M1 Max 64GB, running both Qwen2.5-VL 7B and LLM-jp-4-32B-A3B on Ollama with embedding on a separate Sentence Transformers process is realistic.

How to Handle Image Ingestion

PDFs have a well-established pipeline: text extraction with PyPDFLoader, chunk splitting, embedding.
Images need a bit more thought.

You can pass images directly through multimodal embedding to get vectors.
But to include images returned from Chroma in the LLM prompt, you need to convert them to text.
llama.cpp’s GGUF models don’t accept image inputs.

Two options: generate captions with a VLM at ingestion time and store them in metadata, or run the VLM on-demand when an image comes up in search results.

Generating captions at ingestion makes the ingestion process heavier but keeps search responses fast.
For large image collections, batching caption generation overnight is practical.
On-demand VLM processing keeps ingestion light but adds VLM inference on every search that hits an image.
If image hits are rare in your search patterns, this approach wastes less compute.

Storing captions in metadata means images become searchable through both text search and multimodal search.
Vector search finds similar images; BM25 finds caption text.
The hybrid search discussed in the first half works for images too.

If you only have PDFs, the first half of this article’s setup is sufficient. If screenshots and whiteboard photos are in the mix, adding the image path is worth it.
Look at what format your knowledge actually accumulates in, then decide.