PageIndex - tree RAG with LLM reasoning only, no vector search
Contents
What is this?
PageIndex is a RAG system that builds a hierarchical tree index for documents using only LLM reasoning, without vector databases or chunking. It is developed and published by VectifyAI.
Traditional RAG follows the flow “split the document -> vectorize it -> search by similarity.” PageIndex instead takes an approach closer to how a human uses a table of contents: “detect the document’s outline -> turn it into a tree structure in JSON -> generate a summary for each node.”
Its main selling point is that it avoids the fundamental weakness of vector-search RAG: high similarity does not necessarily mean relevance.
PDF processing pipeline
For PDFs, the system processes documents with six LLM calls:
- Extract text from the PDF
- Detect the table of contents pages with an LLM
- Convert the table of contents into a JSON structure
- Assign page numbers to each section
- Validate and correct the assignments
- Generate summaries for each node in parallel, asynchronously
Each node in the generated tree looks like this:
{
"title": "Section name",
"node_id": "1.2.3",
"start_index": 10,
"end_index": 15,
"summary": "This section explains ...",
"nodes": []
}
Markdown is simpler
For Markdown, the tree is built directly from the header hierarchy (#, ##, ###, …). There is no need to detect a table of contents or assign pages, and the LLM is used only for summary generation.
# PDF
python3 run_pageindex.py --pdf_path tests/pdfs/earthmover.pdf
# Markdown
python3 run_pageindex.py --md_path tests/md/japanese_sample.md
Main options:
| Option | Default | Purpose |
|---|---|---|
--if-add-node-summary | yes | Add summaries to each node |
--if-add-doc-description | no | Generate a description for the whole document |
--if-thinning | no | Merge low-token nodes (Markdown only) |
--model | gpt-4o-2024-11-20 | Model to use |
Relationship to layout detection
This connects directly to my earlier write-up on NDLOCR’s layout detection struggles.
Back then, I used histogram analysis to detect valleys in vertical pixel density and forced the document into four columns. The goal was the same: understand which parts of a document belong together. I just did it with pixel-level layout analysis.
PageIndex solves the same problem by having the LLM understand the structure of the document. It infers section boundaries from the table of contents and header hierarchy, so it does not need image-processing tricks like histograms.
The caveat is that PageIndex assumes text input. If the OCR stage before it has already destroyed the column structure, PageIndex will not help. In other words:
Image -> [OCR (layout detection required)] -> Text -> [PageIndex] -> Tree structure
It does not eliminate layout detection. It is useful after OCR has already extracted the text correctly, when you are building the RAG layer.
Combining it with OCR pipelines
So what is interesting to pair it with?
PaddleOCR-VL-1.5 is a VLM-based OCR system that updated the state of the art for document parsing with 0.9B parameters. As I also wrote in the VLM-OCR article, VLM-OCR is strong because it can structure tables and multi-column layouts in its output.
One possible pipeline looks like this:
PDF / image
↓
PaddleOCR-VL (VLM-OCR: outputs structured tables and column layout)
↓
Structured Markdown
↓
PageIndex (build tree from headers + generate summaries)
↓
Tree-indexed RAG
VLM-OCR solves the columns and layout problem and outputs structured text, while PageIndex turns that into a tree index. That removes the need for brute-force techniques like histogram analysis.
Cost and latency
Compared with vector RAG, the downside is obvious.
After indexing, vector RAG is effectively constant time for search. PageIndex, by contrast, calls the LLM every time it traverses the tree during a query, so search cost scales with the number of LLM calls. Building the index for PDFs also requires six LLM calls.
Where it makes sense:
- When you want high-accuracy search over a small number of important documents, such as internal policies or contracts
- When vector search keeps returning results that are not really relevant
- When the document structure is clear and the table of contents or headers are reliable
If you need to search a large corpus quickly, plain vector RAG is still the better choice.
Links
Related Articles
- Solving NDLOCR column detection with histogram analysis - column detection with image processing
- PaddleOCR-VL-1.5 - VLM-based document parsing
- The rise of VLM-based OCR - DeepSeek-OCR and the potential of hybrid use - overview of VLM-OCR