Tech 4 min read

PageIndex - tree RAG with LLM reasoning only, no vector search

IkesanContents

What is this?

PageIndex is a RAG system that builds a hierarchical tree index for documents using only LLM reasoning, without vector databases or chunking. It is developed and published by VectifyAI.

Traditional RAG follows the flow “split the document -> vectorize it -> search by similarity.” PageIndex instead takes an approach closer to how a human uses a table of contents: “detect the document’s outline -> turn it into a tree structure in JSON -> generate a summary for each node.”

Its main selling point is that it avoids the fundamental weakness of vector-search RAG: high similarity does not necessarily mean relevance.

PDF processing pipeline

For PDFs, the system processes documents with six LLM calls:

  1. Extract text from the PDF
  2. Detect the table of contents pages with an LLM
  3. Convert the table of contents into a JSON structure
  4. Assign page numbers to each section
  5. Validate and correct the assignments
  6. Generate summaries for each node in parallel, asynchronously

Each node in the generated tree looks like this:

{
  "title": "Section name",
  "node_id": "1.2.3",
  "start_index": 10,
  "end_index": 15,
  "summary": "This section explains ...",
  "nodes": []
}

Markdown is simpler

For Markdown, the tree is built directly from the header hierarchy (#, ##, ###, …). There is no need to detect a table of contents or assign pages, and the LLM is used only for summary generation.

# PDF
python3 run_pageindex.py --pdf_path tests/pdfs/earthmover.pdf

# Markdown
python3 run_pageindex.py --md_path tests/md/japanese_sample.md

Main options:

OptionDefaultPurpose
--if-add-node-summaryyesAdd summaries to each node
--if-add-doc-descriptionnoGenerate a description for the whole document
--if-thinningnoMerge low-token nodes (Markdown only)
--modelgpt-4o-2024-11-20Model to use

Relationship to layout detection

This connects directly to my earlier write-up on NDLOCR’s layout detection struggles.

Back then, I used histogram analysis to detect valleys in vertical pixel density and forced the document into four columns. The goal was the same: understand which parts of a document belong together. I just did it with pixel-level layout analysis.

PageIndex solves the same problem by having the LLM understand the structure of the document. It infers section boundaries from the table of contents and header hierarchy, so it does not need image-processing tricks like histograms.

The caveat is that PageIndex assumes text input. If the OCR stage before it has already destroyed the column structure, PageIndex will not help. In other words:

Image -> [OCR (layout detection required)] -> Text -> [PageIndex] -> Tree structure

It does not eliminate layout detection. It is useful after OCR has already extracted the text correctly, when you are building the RAG layer.

Combining it with OCR pipelines

So what is interesting to pair it with?

PaddleOCR-VL-1.5 is a VLM-based OCR system that updated the state of the art for document parsing with 0.9B parameters. As I also wrote in the VLM-OCR article, VLM-OCR is strong because it can structure tables and multi-column layouts in its output.

One possible pipeline looks like this:

PDF / image

PaddleOCR-VL (VLM-OCR: outputs structured tables and column layout)

Structured Markdown

PageIndex (build tree from headers + generate summaries)

Tree-indexed RAG

VLM-OCR solves the columns and layout problem and outputs structured text, while PageIndex turns that into a tree index. That removes the need for brute-force techniques like histogram analysis.

Cost and latency

Compared with vector RAG, the downside is obvious.

After indexing, vector RAG is effectively constant time for search. PageIndex, by contrast, calls the LLM every time it traverses the tree during a query, so search cost scales with the number of LLM calls. Building the index for PDFs also requires six LLM calls.

Where it makes sense:

  • When you want high-accuracy search over a small number of important documents, such as internal policies or contracts
  • When vector search keeps returning results that are not really relevant
  • When the document structure is clear and the table of contents or headers are reliable

If you need to search a large corpus quickly, plain vector RAG is still the better choice.