GLM-OCR (0.9B) sets a new SOTA for document parsing, so I checked columns, vertical text, and math support
Contents
Not long after writing about PaddleOCR-VL-1.5, the 0.9B parameter class moved again. GLM-OCR, released by Zhipu AI and the THUDM group at Tsinghua University, reached 94.62% on OmniDocBench v1.5.
This blog has already tried NDLOCR, Tesseract.js, PaddleOCR, and DeepSeek-OCR before. This time I focused on the areas OCR usually struggles with: paragraph structure, vertical and horizontal text, and math.
What GLM-OCR is
The technical report was published on March 11, 2026.
| Item | Details |
|---|---|
| Parameters | 0.9B (CogViT 0.4B + GLM-0.5B) |
| Developers | Zhipu AI + Tsinghua THUDM |
| License | Code: Apache 2.0, model weights: MIT |
| Supported languages | 100+ including Japanese |
| Output formats | Markdown, JSON with coordinates, LaTeX |
| Usage | vLLM, SGLang, Ollama, HuggingFace Transformers, pip install glmocr |
The architecture has three pieces:
CogViT (vision encoder, 0.4B)
→ cross-modal connector (token downsampling)
→ GLM-0.5B (language decoder)
Its notable feature is Multi-Token Prediction, which averages 5.2 tokens per step and gives roughly a 50% throughput gain over standard autoregressive decoding.
Training happens in four stages:
- Pretrain the vision encoder on hundreds of billions of image-text pairs
- Multimodal pretraining plus MTP adaptation
- OCR-focused supervised fine-tuning for text, formulas, tables, and KIE
- Reinforcement learning with GRPO
The fourth stage is the interesting one. Different rewards are used for different tasks: normalized edit distance for text, CDM for formulas, TEDS for tables, and field-level F1 for KIE. It also penalizes repetition and invalid JSON.
Benchmark comparison
On OmniDocBench v1.5, GLM-OCR slightly beats the model I covered in the PaddleOCR-VL-1.5 article.
| Model | Parameters | Overall |
|---|---|---|
| GLM-OCR | 0.9B | 94.62 |
| PaddleOCR-VL-1.5 | 0.9B | 94.50 |
| MinerU2.5 | 1.2B | 90.67 |
| Gemini-3 Pro | undisclosed | 90.33 |
| Qwen3-VL | 235B | 89.15 |
A 0.9B model outperforming a 235B model by more than five points is striking. Specialization still matters.
| Metric | GLM-OCR | PaddleOCR-VL-1.5 |
|---|---|---|
| Text Edit (lower is better) | 0.040 | 0.035 |
| Formula CDM | 93.90 | 94.21 |
| Table TEDS | 93.96 | - |
| Table TEDS-S | 96.39 | - |
| Reading Order Edit (lower is better) | 0.044 | - |
PaddleOCR-VL-1.5 is slightly better at raw text and formulas, while GLM-OCR is much stronger on tables, which is what pushes the total score ahead.
Paragraphs and layout parsing
One of OCR’s hardest problems is layout parsing. In my NDLOCR histogram article, a four-column vertical book layout forced me to fall back to PyMuPDF and histograms after Layout Parser failed.
GLM-OCR uses a two-stage pipeline:
graph TD
A[Input image] --> B[PP-DocLayout-V3<br/>layout parsing]
B --> C1[Paragraph]
B --> C2[Table]
B --> C3[Formula]
B --> C4[Figure]
B --> C5[Header / footer]
C1 --> D[GLM-OCR Core<br/>parallel recognition]
C2 --> D
C3 --> D
C4 --> D
C5 --> D
D --> E[Merge & Post Process<br/>restore reading order]
E --> F[Structured output<br/>Markdown / JSON]
PP-DocLayout-V3 first splits the page into semantic regions. GLM-OCR then recognizes each region in parallel, and Merge & Post Process restores reading order before producing structured output.
That design reduces hallucinations, but if the first-stage layout detector is wrong, the error flows through the entire pipeline. The report lists that as a known limitation.
Vertical and horizontal text
The report barely discusses vertical text directly.
The model card mentions support for “diverse text orientations,” and the PaddleOCR pipeline includes orientation-classification options. Japanese and Chinese are both supported.
But the internal evaluation shows the multilingual scenario scoring only 69.3, well below categories such as receipt KIE and tables. That suggests CJK text and vertical layouts may still be a weak point.
Compared with NDLOCR, which was designed by the National Diet Library of Japan and is very strong at vertical Japanese text, GLM-OCR seems more capable at structure understanding than at the fine-grained quirks of vertical layout.
Math recognition
Math is one of GLM-OCR’s strongest areas.
| Benchmark | Score |
|---|---|
| UniMERNet | 96.5 (SOTA) |
| Formula CDM (OmniDocBench) | 93.90 |
| OlmOCR-Bench Arxiv Math | 80.7% |
It outputs formulas in LaTeX and handles fractions, subscripts, superscripts, matrices, determinants, and other 2D notations with high accuracy. The report switches into formula mode with the prompt "Formula Recognition:".
As with other VLM-based OCR systems, the image is understood as a whole, which is exactly why math recognition works better than with older pattern-based OCR.
The report also says that highly complex mathematical expressions can still fail, so this is not a universal solution.
Other recognition abilities
The model also covers a broad set of edge cases.
| Category | Score |
|---|---|
| Table recognition (TEDS-S) | 96.39 |
| Header / footer | 95.8% |
| Receipt KIE | 94.5 |
| Seal recognition | 90.5 |
| Baseline | 98.8% |
| Long fine text | 86.9% |
| Handwriting | 87.0% |
| Code documents | 84.7% |
| Multi-column layouts | 76.7% |
| Old scans | 37.6% |
Old scans are still weak at 37.6%, which means degraded paper and faded print remain a challenge. Given that NDLOCR was built for old National Diet Library material, there are still cases where a more traditional OCR stack may be a better fit.
How to try it
The easiest way is Ollama:
ollama run glm-ocr
There is also a Python SDK:
pip install glmocr
It also works through vLLM, SGLang, and HuggingFace Transformers. At 0.9B parameters, it is small enough that edge deployment is still on the table.
When to use VLM OCR versus traditional OCR
Based on all the OCR approaches I have tried in this blog, the rough split looks like this:
| Use case | Recommendation |
|---|---|
| High-accuracy document parsing, especially tables and math | GLM-OCR / PaddleOCR-VL-1.5 |
| Vertical Japanese text and old books | NDLOCR / NDLOCR-Lite |
| In-browser real-time OCR | Tesseract.js |
| OCR post-processing and correction | Encoder model + LLM |
| Bulk PDF processing | DeepSeek-OCR-2 |
| On-device mobile OCR | NDLOCR-Lite + ONNX Runtime |
VLM OCR is excellent at structure understanding and multitask recognition, but niche cases such as vertical text, old scans, and browser execution still give traditional OCR an edge. We are not at the point where one VLM can replace everything.
The 69.3 multilingual score still bothers me. I need to run my own Japanese vertical-text samples through Ollama to see how far it really goes.