PaddleOCR-VL-1.5 - document parsing SOTA with only 0.9B parameters

On January 29, 2026, the PaddlePaddle team at Baidu released PaddleOCR-VL-1.5. This lightweight vision-language model with only 0.9B parameters achieved 94.5% accuracy on the document-parsing benchmark OmniDocBench v1.5, outperforming large models such as GPT-4o and Qwen2.5-VL-72B.

What PaddleOCR-VL Is

PaddleOCR is an open-source OCR toolkit developed by Baidu and supports more than 100 languages. PaddleOCR-VL is its successor, using a vision-language model (VLM) architecture for document parsing.

The architecture consists of two components:

Visual encoder: a NaViT-style dynamic-resolution ViT that flexibly handles varying input image sizes
Language model: ERNIE-4.5-0.3B

With a total of only 0.9B parameters, it is very compact and can be run on a single A100.

Main Improvements in v1.5

Irregular shape localization

It is the first model in the industry to support polygon detection boxes. It can accurately detect regions in tilted, folded, or curved documents, making it strong on real scanned documents where simple rectangular boxes fail.

SOTA across five real-world scenarios

It outperforms existing models in all of the following scenarios:

Scanned documents
Tilted documents
Folded documents
Screen captures
Poorly lit environments

Text spotting and seal recognition

Text spotting, which detects and recognizes text line by line, and seal recognition are newly supported. Both categories achieved SOTA results.

Long-document support

It supports automatic merging of tables across pages and heading recognition, reducing content fragmentation when parsing long PDFs.

Expanded multilingual coverage

The model expands from 109 languages in v1.0 to 111 languages in v1.5, adding Tibetan and Bengali. It also improves recognition of rare characters, old documents, multilingual tables, underlines, and checkboxes.

Benchmark Comparison

OmniDocBench v1.5 (v1.0 model data)

Model	Overall score
PaddleOCR-VL	92.56
MinerU2.5-1.2B	90.67
GPT-4o	Lower than above
Qwen2.5-VL-72B	Lower than above

The v1.5 version is said to improve that score to 94.5%, although a detailed comparison against other models at the v1.5 stage has not yet been published.

Individual metrics (v1.0)

Metric	Score
Text edit distance	0.035 (lower is better)
Formula CDM	91.43
Table TEDS	89.76
Reading-order edit distance	0.043

olmOCR-Bench

It achieved the best unit-test pass rate at 80.0%. In particular, it performed well on ArXiv documents (85.7%) and header/footer handling (97.0%).

Inference speed

Compared with MinerU2.5, it is 15.8% faster in page throughput and 14.2% faster in token throughput. GPU memory usage is almost the same (43.7GB vs 41.9GB on A100).

How to Use It

It is released under the Apache 2.0 license.

Installation

pip install paddlepaddle-gpu==3.2.1
pip install -U "paddleocr[doc-parser]"

CLI

paddleocr doc_parser -i document.png

Python API

from paddleocr import PaddleOCRVL

pipeline = PaddleOCRVL()
output = pipeline.predict("document.png")
for res in output:
    res.save_to_markdown(save_path="output")

HuggingFace Transformers

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_path)

messages = [{"role": "user", "content": [
    {"type": "image", "image": Image.open("document.png")},
    {"type": "text", "text": "OCR:"}
]}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
)
outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(outputs[0])

It also supports a vLLM backend, which is recommended for large-scale document processing. AMD GPUs are also supported on day 0 through ROCm 7.0.

Browser execution is still unrealistic

Even though it is lightweight at 0.9B parameters, running a VLM in the browser with WebGPU or WASM is still difficult. Even after quantization, you still need to download a model of around 500MB, and inference requires at least 1 to 2GB of VRAM. Because it assumes PaddlePaddle’s own framework, there is no mature path to ONNX or WebGPU conversion yet.

If you want to embed it in a web app, using it through an API is the realistic option. I covered the problems I ran into when trying browser OCR with PaddleJS in this article.