PaddleOCR-VL-1.5 - document parsing SOTA with only 0.9B parameters
Contents
On January 29, 2026, the PaddlePaddle team at Baidu released PaddleOCR-VL-1.5. This lightweight vision-language model with only 0.9B parameters achieved 94.5% accuracy on the document-parsing benchmark OmniDocBench v1.5, outperforming large models such as GPT-4o and Qwen2.5-VL-72B.
What PaddleOCR-VL Is
PaddleOCR is an open-source OCR toolkit developed by Baidu and supports more than 100 languages. PaddleOCR-VL is its successor, using a vision-language model (VLM) architecture for document parsing.
The architecture consists of two components:
- Visual encoder: a NaViT-style dynamic-resolution ViT that flexibly handles varying input image sizes
- Language model: ERNIE-4.5-0.3B
With a total of only 0.9B parameters, it is very compact and can be run on a single A100.
Main Improvements in v1.5
Irregular shape localization
It is the first model in the industry to support polygon detection boxes. It can accurately detect regions in tilted, folded, or curved documents, making it strong on real scanned documents where simple rectangular boxes fail.
SOTA across five real-world scenarios
It outperforms existing models in all of the following scenarios:
- Scanned documents
- Tilted documents
- Folded documents
- Screen captures
- Poorly lit environments
Text spotting and seal recognition
Text spotting, which detects and recognizes text line by line, and seal recognition are newly supported. Both categories achieved SOTA results.
Long-document support
It supports automatic merging of tables across pages and heading recognition, reducing content fragmentation when parsing long PDFs.
Expanded multilingual coverage
The model expands from 109 languages in v1.0 to 111 languages in v1.5, adding Tibetan and Bengali. It also improves recognition of rare characters, old documents, multilingual tables, underlines, and checkboxes.
Benchmark Comparison
OmniDocBench v1.5 (v1.0 model data)
| Model | Overall score |
|---|---|
| PaddleOCR-VL | 92.56 |
| MinerU2.5-1.2B | 90.67 |
| GPT-4o | Lower than above |
| Qwen2.5-VL-72B | Lower than above |
The v1.5 version is said to improve that score to 94.5%, although a detailed comparison against other models at the v1.5 stage has not yet been published.
Individual metrics (v1.0)
| Metric | Score |
|---|---|
| Text edit distance | 0.035 (lower is better) |
| Formula CDM | 91.43 |
| Table TEDS | 89.76 |
| Reading-order edit distance | 0.043 |
olmOCR-Bench
It achieved the best unit-test pass rate at 80.0%. In particular, it performed well on ArXiv documents (85.7%) and header/footer handling (97.0%).
Inference speed
Compared with MinerU2.5, it is 15.8% faster in page throughput and 14.2% faster in token throughput. GPU memory usage is almost the same (43.7GB vs 41.9GB on A100).
How to Use It
It is released under the Apache 2.0 license.
Installation
pip install paddlepaddle-gpu==3.2.1
pip install -U "paddleocr[doc-parser]"
CLI
paddleocr doc_parser -i document.png
Python API
from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL()
output = pipeline.predict("document.png")
for res in output:
res.save_to_markdown(save_path="output")
HuggingFace Transformers
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype=torch.bfloat16
).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_path)
messages = [{"role": "user", "content": [
{"type": "image", "image": Image.open("document.png")},
{"type": "text", "text": "OCR:"}
]}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
)
outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(outputs[0])
It also supports a vLLM backend, which is recommended for large-scale document processing. AMD GPUs are also supported on day 0 through ROCm 7.0.
Browser execution is still unrealistic
Even though it is lightweight at 0.9B parameters, running a VLM in the browser with WebGPU or WASM is still difficult. Even after quantization, you still need to download a model of around 500MB, and inference requires at least 1 to 2GB of VRAM. Because it assumes PaddlePaddle’s own framework, there is no mature path to ONNX or WebGPU conversion yet.
If you want to embed it in a web app, using it through an API is the realistic option. I covered the problems I ran into when trying browser OCR with PaddleJS in this article.