GLM-OCR (0.9B) sets a new SOTA for document parsing, so I checked columns, vertical text, and math support

Not long after writing about PaddleOCR-VL-1.5, the 0.9B parameter class moved again. GLM-OCR, released by Zhipu AI and the THUDM group at Tsinghua University, reached 94.62% on OmniDocBench v1.5.

This blog has already tried NDLOCR, Tesseract.js, PaddleOCR, and DeepSeek-OCR before. This time I focused on the areas OCR usually struggles with: paragraph structure, vertical and horizontal text, and math.

What GLM-OCR is

The technical report was published on March 11, 2026.

Item	Details
Parameters	0.9B (CogViT 0.4B + GLM-0.5B)
Developers	Zhipu AI + Tsinghua THUDM
License	Code: Apache 2.0, model weights: MIT
Supported languages	100+ including Japanese
Output formats	Markdown, JSON with coordinates, LaTeX
Usage	vLLM, SGLang, Ollama, HuggingFace Transformers, `pip install glmocr`

The architecture has three pieces:

CogViT (vision encoder, 0.4B)
  → cross-modal connector (token downsampling)
    → GLM-0.5B (language decoder)

Its notable feature is Multi-Token Prediction, which averages 5.2 tokens per step and gives roughly a 50% throughput gain over standard autoregressive decoding.

Training happens in four stages:

Pretrain the vision encoder on hundreds of billions of image-text pairs
Multimodal pretraining plus MTP adaptation
OCR-focused supervised fine-tuning for text, formulas, tables, and KIE
Reinforcement learning with GRPO

The fourth stage is the interesting one. Different rewards are used for different tasks: normalized edit distance for text, CDM for formulas, TEDS for tables, and field-level F1 for KIE. It also penalizes repetition and invalid JSON.

Benchmark comparison

On OmniDocBench v1.5, GLM-OCR slightly beats the model I covered in the PaddleOCR-VL-1.5 article.

Model	Parameters	Overall
GLM-OCR	0.9B	94.62
PaddleOCR-VL-1.5	0.9B	94.50
MinerU2.5	1.2B	90.67
Gemini-3 Pro	undisclosed	90.33
Qwen3-VL	235B	89.15

A 0.9B model outperforming a 235B model by more than five points is striking. Specialization still matters.

Metric	GLM-OCR	PaddleOCR-VL-1.5
Text Edit (lower is better)	0.040	0.035
Formula CDM	93.90	94.21
Table TEDS	93.96	-
Table TEDS-S	96.39	-
Reading Order Edit (lower is better)	0.044	-

PaddleOCR-VL-1.5 is slightly better at raw text and formulas, while GLM-OCR is much stronger on tables, which is what pushes the total score ahead.

Paragraphs and layout parsing

One of OCR’s hardest problems is layout parsing. In my NDLOCR histogram article, a four-column vertical book layout forced me to fall back to PyMuPDF and histograms after Layout Parser failed.

GLM-OCR uses a two-stage pipeline:

graph TD
    A[Input image] --> B[PP-DocLayout-V3<br/>layout parsing]
    B --> C1[Paragraph]
    B --> C2[Table]
    B --> C3[Formula]
    B --> C4[Figure]
    B --> C5[Header / footer]
    C1 --> D[GLM-OCR Core<br/>parallel recognition]
    C2 --> D
    C3 --> D
    C4 --> D
    C5 --> D
    D --> E[Merge & Post Process<br/>restore reading order]
    E --> F[Structured output<br/>Markdown / JSON]

PP-DocLayout-V3 first splits the page into semantic regions. GLM-OCR then recognizes each region in parallel, and Merge & Post Process restores reading order before producing structured output.

That design reduces hallucinations, but if the first-stage layout detector is wrong, the error flows through the entire pipeline. The report lists that as a known limitation.

Vertical and horizontal text

The report barely discusses vertical text directly.

The model card mentions support for “diverse text orientations,” and the PaddleOCR pipeline includes orientation-classification options. Japanese and Chinese are both supported.

But the internal evaluation shows the multilingual scenario scoring only 69.3, well below categories such as receipt KIE and tables. That suggests CJK text and vertical layouts may still be a weak point.

Compared with NDLOCR, which was designed by the National Diet Library of Japan and is very strong at vertical Japanese text, GLM-OCR seems more capable at structure understanding than at the fine-grained quirks of vertical layout.

Math recognition

Math is one of GLM-OCR’s strongest areas.

Benchmark	Score
UniMERNet	96.5 (SOTA)
Formula CDM (OmniDocBench)	93.90
OlmOCR-Bench Arxiv Math	80.7%

It outputs formulas in LaTeX and handles fractions, subscripts, superscripts, matrices, determinants, and other 2D notations with high accuracy. The report switches into formula mode with the prompt "Formula Recognition:".

As with other VLM-based OCR systems, the image is understood as a whole, which is exactly why math recognition works better than with older pattern-based OCR.

The report also says that highly complex mathematical expressions can still fail, so this is not a universal solution.

Other recognition abilities

The model also covers a broad set of edge cases.

Category	Score
Table recognition (TEDS-S)	96.39
Header / footer	95.8%
Receipt KIE	94.5
Seal recognition	90.5
Baseline	98.8%
Long fine text	86.9%
Handwriting	87.0%
Code documents	84.7%
Multi-column layouts	76.7%
Old scans	37.6%

Old scans are still weak at 37.6%, which means degraded paper and faded print remain a challenge. Given that NDLOCR was built for old National Diet Library material, there are still cases where a more traditional OCR stack may be a better fit.

How to try it

The easiest way is Ollama:

ollama run glm-ocr

There is also a Python SDK:

pip install glmocr

It also works through vLLM, SGLang, and HuggingFace Transformers. At 0.9B parameters, it is small enough that edge deployment is still on the table.

When to use VLM OCR versus traditional OCR

Based on all the OCR approaches I have tried in this blog, the rough split looks like this:

Use case	Recommendation
High-accuracy document parsing, especially tables and math	GLM-OCR / PaddleOCR-VL-1.5
Vertical Japanese text and old books	NDLOCR / NDLOCR-Lite
In-browser real-time OCR	Tesseract.js
OCR post-processing and correction	Encoder model + LLM
Bulk PDF processing	DeepSeek-OCR-2
On-device mobile OCR	NDLOCR-Lite + ONNX Runtime

VLM OCR is excellent at structure understanding and multitask recognition, but niche cases such as vertical text, old scans, and browser execution still give traditional OCR an edge. We are not at the point where one VLM can replace everything.

The 69.3 multilingual score still bothers me. I need to run my own Japanese vertical-text samples through Ollama to see how far it really goes.